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RESEARCH  OVERVIEW 


It  has  been  an  open  question  in  electrical  network  theory  whether  it  is 
possible  to  synthesize  a  network  with  a  prescribed  natural  frequency  (in  the 
complex  s-plane)  out  of  a  restricted  class  of  components.  It  was  possible  to 
say  since  the  1950s  that  no  natural  frequencies  can  be  obtained  in  certain 
parts  of  the  s-plane,  but  not  the  converse,  namely  that  at  other  points  a 
circuit  can  be  devised.  This  question  is  of  modern  importance  since  the 
components  in  question  may  be  MOS  transistors  and  RC  lines,  whose  accurate 
high-frequency  models  are  complicated.  It  is  important  to  know  how  fast  a 
rise  time  can  be  achieved,  or  at  what  frequencies  unwanted  oscillations  might 
occur.  A  set  of  necessary  and  sufficient  bounds  are  now  possible,  in  the 
sense  that  every  point  in  the  s-plane  can  be  easily  discovered  to  be  either  a 
frequency  at  which  oscillation  cannot  occur  with  any  possible  combination  of 
components  and  ideal  transformers,  or  else  a  point  at  which  a  circuit 
consisting  of  a  few  such  components  and  ideal  transformers  can  be  made  to 
oscillate.  The  inclusion  of  ideal  transformers  is  necessary  since  otherwise 
generally  only  a  finite  or  countable  number  .f  natural  frequencies  can  be 
found . 

Many  new  results  have  been  derived  for  parallel  algorithms  and  complexity. 

One  of  the  most  astonishing  is  that  a  hypercube  with  a  large  number  of  faulty 
nodes  can  be  used,  with  high  probability,  as  another  perfectly  functioning 
hypercube  of  half  the  size,  by  using  reconfiguration  algorithms  that  are 
simple,  fast,  and  require  only  local  information. 

The  design  of  a  message-driven  processor  continues.  It  is  now  being  realized 
that  many  different  highly  parallel  architectures  require  roughly  the  same 
sort  of  processing  node,  one  that  can  respond  quickly  (i.e.,  with  low  latency) 
to  messages  that  may  require  execution  of  a  few  (say  about  ten)  instructions. 
The  processor  being  designed  can  be  considered  an  experiment  in  unifying 
message-passing  and  shared-memory  architectures. 

The  waveform  bounding  work  is  continuing  at  a  slightly  slower  pace  because 
Prof.  John  Wyatt  is  at  Caltech  working  with  Carver  Mead  for  the  spring 
semester.  Nevertheless  there  is  progress  in  two  areas.  New  macromodels  have 
been  developed  for  ECL  circuits,  and  a  depth-interpolation  vision  chip  has 
been  designed  and  submitted  for  fabrication. 

The  work  on  Schema  is  winding  down,  as  students  working  directly  on  Schema 
projects  finish  up.  Prof.  Richard  Zippel,  who  was  responsible  for  Schema,  has 
now  left  MIT.  This  project  was  very  effective  as  a  bridge  among  several 
research  faculty  and  students,  and  has  in  that  sense  served  its  purpose. 

Two  of  the  projects  supported  by  this  contract  were  reported  at  the  recent 
Conference  on  Advanced  Research  in  VLSI,  held  at  Stanford  University,  March 
23-25,  1987.  These  are  an  analysis  of  multiprocessor  communication  networks, 
and  a  technique  for  powering  and  communicating  with  an  IC  chip  without  having 
any  physical  connections  such  as  wires. 
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THE  WAVEFORM  BOUNDING  APPROACH  TO  TIMING  ANALYSIS 


The  timing  analyzer  for  ECL  standard  cell  design,  mentioned  in  the  last 
report,  approaches  completion.  It  will  definitely  be  used  as  an  in-house 
design  tool  by  DEC  if  it  works  as  we  expect. 

Gates  in  the  ECL  library  are  represented  by  macromodels.  The  macromodel 
parameters  must  be  chosen  carefully  so  that  the  macromodel  waveforms  closely 
resemble  the  output  of  computation-intensive  SPICE  simulations  of  the  gates 
when  loaded  by  fanout  to  other  gates  through  interconnect.  The  straightfor¬ 
ward  approach  involves  a  lot  of  least-squares  curve  fitting  with  iterative 
variation  of  the  macromcdel  parameuets  and  is  therefore  also  very  computation¬ 
intensive.  In  response  to  this  problem,  Peter  O'Brien  has  cleverly  reformula¬ 
ted  the  macromodel  so  that  the  first  two  moments  of  the  macromodel  waveforms 
can  be  calculated  in  advance  in  closed  form.  These  explicit  formulas  are  then 
used  in  the  parameter-fitting  process,  resulting  in  a  dramatic  savings  in  com¬ 
puter  time.  It  is  possible  that  this  type  of  macromodel  will  become  more 
widely  used  because  of  this  advantage. 

We  have  submitted  CIF  files  to  MOSIS  for  our  other  project,  the  design  of  an 
analog  depth-interpolation  chip  for  computer  vision  applications.  This  first 
design  was  more  of  a  technology-exploration  effort  than  an  attempt  to  produce 
a  useful  product.  The  idea  is  to  use  a  regular  planar  array  of  linear  resis¬ 
tors  to  rapidly  interpolate  a  2-D  array  of  voltages  that  encode  a  (noisy  and 
incomplete)  array  of  depth  estimates  for  a  3-D  scene.  A  fundamental  problem 
in  the  initial  approach  is  that  any  such  linear  interpolation  scheme  not  only 
smooths  out  noise  and  missing  data,  it  also  smooths  over  depth  boundaries 
where  actual  discontinuities  occur  in  the  scene.  Through  consultation  with 
Prof.  Carver  Mead  at  Caltech,  we  have  found  a  way  to  synthesize  a  nonlinear 
"resistor"  with  a  characteristic  of  the  form  i  =  I  x  tanh(qv/kT).  Such  a 
current-limited  "resistor"  appears  linear  for  |v|  <<  kT/q,  but  supplies  a 
maximum  current  of  magnitude  I  regardless  of  the  potential  drop  across  it. 

In  our  application  it  should  smoothly  interpolate  small  changes  in  depth,  but 
provide  clean  "breaks"  at  depth  discontinuities. 
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HIGH  PERFORMANCE  CIRCUIT  DESIGN 


Maximum  Frequency  of  Oscillation:  Ve  have  nov  shown  that  the  bounds  on  the 
maximum  frequency  of  oscillation  of  linear  networks  discussed  in  the  last 
progress  report  are  tight.  That  is,  not  only  can  we  say  when  a  set  of  com¬ 
ponents  cannot  produce  a  circuit  which  can  oscillate,  but  we  can  now  say  that 
unless  we  predict  that  it  cannot  oscillate,  then  a  circuit  can  be  found  that 
does  oscillate.  Ve  have  also  developed  a  program  on  the  Symbolics  LISP 
machine  that  will  evaluate  complex  component  models  to  see  whether  circuits 
built  of  them  can  or  cannot  oscillate.  This  program  has  been  tested  on  a 
transistor  small-signal  model  with  about  two  dozen  elements.  A  C  version  is 
under  development. 

Modeling  of  magnetostatic  couplers  for  CMOS  VLSI  chips:  We  have  now  demon¬ 
strated  inductive  power  coupling  into  a  zero-pin  chip.  By  the  use  of  an  on- 
chip  bridge  rectifier,  voltages  up  to  10  V  dc  and  powers  up  to  1  mV  have  been 
successfully  coupled  into  a  bulk  CMOS  chip  using  an  HP  3312A  function  genera¬ 
tor  driving  an  external  coil. 
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ARCHITECTURAL  DESIGN 

Prof.  Leighton  is  continuing  work  on  wafer-scale  integration  of  systolic 
arrays,  parallel  algorithms  and  architectures,  and  fault-tolerance.  In  the 
area  of  wafer-scale  integration,  he  and  a  student  (John  Burroughs)  are 
developing  efficient  algorithms  for  integrating  2-dimensional  arrays  on  a 
wafer  containing  randomly  located  faults.  The  current  work  extends  the  theo¬ 
retical  work  reported  in  the  May,  1986  VLSI  research  review  by  developing  and 
coding  algorithms  for  cell  assignment  and  routing  that  work  well  experiment¬ 
ally  for  arrays  of  sizes  5x5  to  100  x  100.  This  work  will  be  described 
in  John  Burrough's  Bachelor's  Thesis  and  a  forthcoming  technical  report.  It 
is  a  nice  example  of  good  theoretical  ideas  and  asymptotic  proofs  being 
adapted  to  work  experimentally  on  realistic  problems. 

In  the  area  of  parallel  algorithms  and  architectures,  Prof.  Leighton  and  Brncp 
Maggs  are  developing  efficient  hash  functions  for  many-one  routing  on  a  hyper¬ 
cube.  Although  greedy  algorithms  have  long  been  known  to  work  well  for  one- 
one  routing  problems  on  the  hypercube,  many-one  routing  problems  can  be  much 
more  difficult.  In  fact,  if  the  destinations  of  the  requests  are  not  random¬ 
ized,  many-one  routing  problems  can  be  intractable  even  if  combining  is 
allowed.  In  the  present  work,  Leighton  and  Maggs  are  developing  simple  hash 
functions  for  randomizing  shared  memory  locations  so  that  any  many-one  routing 
problem  can  be  efficiently  solved  by  a  greedy  type  of  algorithm.  Many-one 
routing  problems  on  the  hypercube  are  important  since  they  provide  the  basis 
for  simulating  a  CRCV  PRAM,  a  very  general  and  powerful  architecture-indepen- 
dent  model  for  parallel  computation.  One-one  routing  problems,  on  the  other 
hand,  are  only  sufficient  to  simulate  an  EREW  PRAM,  a  much  weaker  model  of 
parallel  computation.  The  difficulty  of  many-one  routing  problems  has  been 
observed  in  many  contexts  including  the  speed  with  which  routing  can  be 
performed  on  a  Connection  Machine.  It  is  hoped  that  the  present  work  will 
lead  to  substantially  improved  routing  algorithms. 

Also  in  the  area  of  parallel  algorithms  and  architectures,  Prof.  Leighton  and 
non-MIT  coworkers  have  discovered  very  efficient  ways  of  using  the  hypercube 
to  simulate  special  purpose  architectures,  without  paying  the  usual  overhead 
due  to  routing.  For  example,  it  has  long  been  known  that  an  N-node  hypercube 
contains  any  N-node  array  of  any  dimension  as  a  subgraph.  Hence  array-based 
algorithms  can  be  (and  frequently  are)  directly  implemented  on  a  hypercube 
without  any  overhead.  Recently,  Leighton  et.  al.  have  shown  that  any  binary 
tree  can  also  be  embedded  in  the  hypercube.  Binary  tree  structures  are  useful 
in  some  numerical  and  parsing  calculations  as  well  as  in  the  implementation  of 
some  divide-and-conquer  algorithms.  In  addition,  Prof.  Leighton  has  shown 
that  the  powerful  mesh  of  trees  network  is  also  a  subgraph  of  the  hypercube 
and  hence  a  large  number  of  graph  and  matrix  calculations  can  be  directly 
implemented  on  the  hypercube.  For  example,  the  transitive  closure  of  an 
N-node  graph  can  now  be  computed  on  an  N2-node  hypercube  in  0(log2N)  steps, 
and  two  N  x  N  matrices  can  be  multiplied  in  0(log  N)  steps  on  an  N3-node 
hypercube.  Although  not  difficult,  the  embedding  of  the  mesh  of  trees  in  the 
hypercube  is  nonobvious,  and  it  dramatically  increases  the  ability  of  the 
hypercube  to  perform  special  purpose  calculations. 
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In  the  area  of  fault-tolerance,  John  Hastad,  Prof.  Leighton,  and  Mark  Newman 
are  developing  ways  of  reconfiguring  a  hypercube  in  the  presence  of  a  poten¬ 
tially  large  number  of  randomly  located  faults.  Thus  far,  the  work  has  been 
very  promising.  Among  other  things,  Hastad,  Leighton  and  Newman  have  shown 
that  with  high  probability,  an  N/2-node  hypercube  can  be  one-one  embedded 
into  the  live  nodes  of  an  N-node  hypercube  containing  pN  randomly  located 
faults  (p  <  1 h )  so  that  neighboring  nodes  of  the  N/2-node  hypercube  are 
mapped  to  nodes  at  distance  3  or  less  apart  in  the  N-node  hypercube.  Hence  a 
hypercube  containing  a  very  large  number  of  randomly  located  faults  has  virtu¬ 
ally  the  same  computational  power  as  a  fully  functioning  hypercube!  Moreover, 
the  algorithm  for  reconfiguring  the  hypercube  is  simple,  fast,  and  works  using 
only  local  control. 

James  K.  Park  has  been  working  on  a  deterministic,  on-line  message-routing 
algorithm  for  a  variant  of  the  fat-tree  that  uses  only  constant-sized 
switches.  This  algorithm  routes  arbitrary  message  sets  M  in  0(A(M)  log2n) 
bit  steps.  The  algorithm  is  interesting  in  that  it  uses  the  network  both  for 
routing  messages  and  for  computing  how  much  congestion  there  is  in  various 
parts  of  the  network.  Charles  Leiserson  and  James  Park  are  currently  revising 
and  simplifying  the  algorithm. 

Tom  Cormen  continued  his  work  on  concentrator  switch  designs,  showing  that  a 
switch  that  e-nearsorts  its  incoming  valid  bits  is  an  (n,  m,  1  -  e/m)  par¬ 
tial  concentrator  switch.  Using  this  result,  he  designed  multichip  partial 
concentrator  switches  based  on  two  algorithms  for  sorting  on  a  mesh.  The 
first,  bosed  on  Revsort  (Schnorr  and  Shamir),  is  an  (n,  m,  1  -  0(n3/Vm)) 
partial  concentrator  with  at  most  2/n  +  l(lg  n)/2j  data  pins  per  chip, 

9(/n)  chips,  and  volume  9(n3/J)  in  which  a  message  incurs  3  lg  n  +  0(1) 
gate  delays.  The  second  switch,  based  on  Columnsort  (Leighton),  is  an 
(n,  m,  1  -  0(n2_2P))  partial  concentrator  switch  with  9(np)  data  pins  per 
chip,  9(n1_p)  chips,  and  volume  9(n1+p),  for  any  %  <  0  <  1.  A  message 
incurs  40  lg  n  +  0(1)  gate  delays  in  passing  through  this  switch. 

Cynthia  Phillips  and  Charles  Leiserson  completed  their  development  of  a  simple 
parallel  algorithm  for  contraction  of  n-node  bounded-degree  planar  graphs 
which  solves  the  problem  of  region  labeling  in  vision  systems  and  leads  to 
algorithms  to  compute  spanning  trees  and  biconnected  components  of  bounded- 
degree  planar  graphs.  The  connected  components  algorithm  runs  in  0(lg  n) 
randomized  time  on  the  restrictive  exclusive-read  exclusive-write  PRAM  model. 
They  are  currently  trying  to  extend  these  results  to  planar  graphs  of  arbi¬ 
trary  degree.  Phillips  is  trying  to  develop  an  0(lg  n)-time  parallel  algo¬ 
rithm  for  region  labeling  in  n-voxel  3D  images  by  exploiting  the  restrictive 
topology  and  geometry  of  the  spatial  tesselation.  She  is  also  investigating 
the  use  of  randomized  search  techniques  for  fast,  simple  connected  components 
algorithms  for  general  graphs. 

Shlomo  Kipnis,  Joe  Kilian,  and  Charles  Leiserson  have  been  investigating  the 
power  of  bussed  interconnection  schemes  for  realizing  permutations  among 
chips.  They  established  a  correspondence  between  bussed  permutation  architec¬ 
tures  and  difference  covers  for  permutation  sets.  As  an  example,  only  /n 
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pins  per  chip  are  needed  to  realize  all  cyclic  shifts  among  n  chips  in  one 
clock  cycle.  Using  point-to-point  vires,  n  -  1  pins  per  chip  are  required. 
They  also  show  that  0(/n)  pins  per  chip  suffice  to  realize  any  abelian  group 
of  permutations,  and  any  general  group  of  permutations  requires  0(/n  Ig  n) 
pins  per  chip.  Some  other  results  include  bussed  interconnection  schemes  for 
hypercubes  (d  +  1  pins  per  chip),  shuffle-exchange  graphs  (3  pins  per  chip), 
and  d-dimensional  meshes  (d  +  1  pins  per  chip). 

Alex  Ishii  and  Charles  Leiserson  have  continued  work  on  understanding  the 
timing  of  level-clocked  synchronous  circuits.  For  periodic  clocking  disci¬ 
plines,  they  now  have  an  efficient  graph-theoretic  algorithm  for  determining 
whether  a  circuit  operates  properly. 

Miller  Maley  is  completing  his  Ph.D.  on  planar  routing.  He  has  provided  a 
solid  topological  framework  for  understanding  routing  problems  in  geometric 
terms,  which  has  led  to  good  algorithms  for  routing,  routability  testing,  and 
compaction  with  automatic  jog  introduction. 

Andrew  Goldberg  has  completed  his  Ph.D.  on  parallel  and  sequential  algorithms 
for  graph  problems.  Among  his  recent  results  is  a  new  algorithm  (joint  with 
R.  E.  Tarjan  of  Princeton)  for  determining  the  minimum-cost  flow  in  a  network. 
The  algorithm  is  based  on  a  new  successive-approximation  technique  and  is 
theoretically  the  best  algorithm  to  date  for  many  instances  of  the  problem. 
Applications  of  mincost  flow  in  VLSI  include  pad  routing  and  optimally  ter¬ 
minating  interconnections.  Its  applications  in  the  field  of  operations  man¬ 
agement  are  more  extensive. 

Paul  Beame,  Tom  Leighton,  and  Charles  Leiserson  observed  that  a  shuffle- 
exchange  graph  or  a  butterfly  network  on  n  nodes  can  simulate  an  n-node 
hypercube  with  a  slowdown  of  only  0(lg  lg  n),  instead  of  the  usual  0(lg  n), 
if  the  simulation  is  off-line  and  the  hypercube  uses  only  one  dimension  at  a 
time. 

Serge  Plotkin  observed  that  the  firing-squad  problem,  normally  stated  for  a 
linear  array  of  finite  automata,  could  be  solved  on  an  arbitrary  bounded- 
degree  graph  in  time  proportional  to  the  diameter  of  the  graph. 

Andrew  Goldberg  and  Serge  Plotkin  have  been  developing  efficient  parallel 
algorithms  for  symmetry-breaking  in  sparse  graphs  of  processors.  The 
symmetry-breaking  algorithms  give  efficient  ways  to  convert  probabilistic 
algorithms  to  deterministic  algorithms.  Some  of  the  techniques  have  been 
applied  to  construct  several  efficient  linear-processor  algorithms  for  graph 
problems,  including  an  0(lg*  n)-time  algorithm  for  (A  +  l)-coloring  of 
constant-degree  graphs. 

Philip  Klein  has  developed,  in  collaboration  with  John  Reif  of  Duke,  an 
efficient  parallel  algorithm  for  planarity.  On  n-node  graphs,  the  algorithm 
works  in  0(log2  n)  time  using  only  n  processors,  in  contrast  to  the 
previous  best  algorithm  which  used  about  n3  processors  to  achieve  the  same 
time  bound.  In  October,  1986,  he  presented  the  research  at  the  IEEE  Symposium 
for  Foundations  of  Computer  Science.  He  is  currently  working  on  represents- 
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tions  of  directed  graphs  that  permit  reachability  queries  to  be  answered 
efficiently  in  parallel. 

Paul  Beame  has  been  working  on  methods  for  proving  lower  bounds  on  the 
resources  needed  by  parallel  computers  in  order  to  solve  various  computational 
problems.  He  has  extended  the  results  of  his  previous  joint  work  with  Johan 
Hastad  in  this  area  to  include  new  graph-theoretic  problems  including  finding 
small  cliques  in  graphs.  He  is  also  investigating  the  computational  advan 
tages  of  concurrent-read  memory  access  for  parallel  computers.  Paul  has 
improved  the  running  time  of  the  best  known  parallel  machine  algorithms  for 
symmetry-breaking  on  bounded-degree  graphs  to  O(log  log*  n).  Also,  with 
Baruch  Awerbuch,  he  has  shown  that  for  distributed  computation  there  are 
several  infinite  families  of  graphs  for  which  known  symmetry-breaking  algo¬ 
rithms  are  optimal. 
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SYSTEMS 


Message-Dri ven  Processor 

The  natural  grain  size  of  many  parallel  algorithms  is  about  10  instructions. 

To  fully  exploit  the  concurrency  in  such  algorithms,  we  must  be  able  to 
efficiently  execute  tasks  of  this  length.  The  message  transmission  and 
reception  overhead  of  existing  systems  is  in  excess  of  200  instruction  times. 
With  such  a  large  overhead,  these  systems  must  execute  tasks  at  the  arti¬ 
ficially  large  grain  size  of  about  1,000  instructions.  If  we  can  reduce  the 
overhead  and  operate  at  the  natural  grain  size,  we  can  effectively  apply  100 
times  as  many  processing  elements  to  the  problem. 

The  Message-Driven  Processor  (MDP)  is  a  processing  element  for  a  fine-grain, 
message-passing  concurrent  computer.  This  36-bit,  tagged  machine  is  being 
designed  as  a  single-chip  processing  element  incorporating  on-chip  memory  and 
router.  Ue  are  using  the  MDP  as  a  vehicle  for  experimenting  with  methods  of 
reducing  processor  latency  on  message  reception.  The  current  version  of  the 
MDP  can  respond  to  an  arriving  message  in  less  than  0.5  us  as  compared  to  the 
300  us  response  time  in  the  Intel  iPSC.  This  low  latency  is  achieved  by 
providing  hardware  for  queuing  messages  and  for  dispatching  instruction 
sequences  in  response  to  message  arrival. 

The  control  mechanism  of  the  MDP  is  driven  by  the  arriving  message  stream. 

The  MDP  dispatches  control  on  the  basis  of  messages  arriving  over  the  network 
just  as  a  conventional  processor  dispatches  control  on  the  basis  of  instruc¬ 
tions  fetched  from  memory.  When  a  message  is  received  by  an  MDP,  a  decision 
is  made  in  hardware  to  either  queue  the  message  (without  slowing  the  executing 
task),  or  to  interrupt  current  task  and  execute  the  message.  If  the  message 
is  to  be  executed,  control  is  immediately  dispatched  to  the  appropriate 
instruction  sequence  and  useful  instructions  are  executed  five  clock  cycles 
after  message  arrival. 

To  achieve  a  fast  context  switch,  the  MDP  is  a  memory-based  rather  than 
register-based  machine.  Only  four  user  registers  need  be  saved  on  a  task 
switch,  and  two  register  sets  are  provided  so  that  the  most  common  task 
switches  can  be  performed  with  no  saving.  The  MDP  can  access  its  on-chip 
memory  in  a  single  clock  cycle,  eliminating  the  need  for  a  large  register  set. 
Each  MDP  instruction  may  take  one  operand  from  memory.  A  small  register  set 
is  used  to  provide  the  remaining  two  operands  since  the  cost  of  multi-porting 
the  memory  is  excessive.  Many  of  the  techniques  used  on  uniprocessors  to  keep 
data  on  chip  (load/store  instruction  sets,  register  windows,  stack  caches, 
etc.)  do  not  work  well  on  a  multicomputer  where  context  switches  happen  every 
10  instructions  as  opposed  to  every  25,000,  where  there  is  little  LIFO  alloca¬ 
tion  of  stack  frames.  With  the  MDP  we  have  found  that  a  simple  memory-based 
instruction  set  operating  out  of  a  small  on-chip  memory  supports  the  needs  of 
multicomputer  programs. 

The  MDP  is  also  an  experiment  in  unifying  shared-memory  and  message-passing 
parallel  computers.  Shared-memory  machines  provide  a  uniform  global  name 
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space  (address  space)  that  allows  processing  elements  to  access  data  regard¬ 
less  of  its  location.  Message-passing  machines  perform  communication  and 
synchronization  via  node-to-node  messages.  These  two  concepts  are  not 
mutually  exclusive.  The  MDP  provides  a  virtual  addressing  mechanism  intended 
o  support  a  global  name  space  while  using  an  execution  mechanism  based  on 
message  passing.  While  our  plans  are  to  program  the  machine  using  an  actor 
model  of  computation  we  provide  mechanisms  to  efficiently  support  models 
ranging  from  dataflow  to  communicating  sequential  processes. 

The  MDP  uses  a  memory  architecture  that  allows  both  indexed  and  set-associa¬ 
tive  accesses  to  its  on-chip  memory.  The  set-associative  mode  is  used  by  the 
run-time  system  to  provide  virtual  addressing,  and  to  accelerate  late-binding 
method  lookup.  By  building  comparators  into  the  column  multiplexer  of  the  on- 
chip  RAM,  we  are  able  to  provide  set-associative  access  with  only  a  small 
increase  in  the  size  of  the  RAM's  peripheral  circuitry.  The  MDP  is  the  only 
machine  we  are  aware  of  that  makes  the  address  translation  mechanism  expli¬ 
citly  available  to  the  programmer.  In  writing  system  code,  we  have  found  it 
to  be  an  extremely  useful  feature. 

Over  the  past  six  months  we  have  constructed  three  register-transfer  simula¬ 
tors  of  the  MDP.  Each  simulator  marked  a  step  in  the  evolution  of  the  MDP  to 
its  present  form.  A  major  step  was  the  elimination  of  a  large  hard-wired 
message  set  in  favor  of  a  single  message  type  (execute)  and  the  use  of 
instruction  sequences  to  define  more  complex  messages.  Our  simulations  showed 
that  defining  messages  with  instruction  sequences  required  only  one  or  two 
additional  clock  cycles  over  the  hard-wired  approach.  The  added  flexibility 
and  the  simplification  of  the  design  resulting  from  a  single  message  type  more 
than  offset  this  small  performance  penalty.  The  flexibility  is  particularly 
important  in  an  experimental  machine.  For  example,  we  can  redefine  system 
messages  such  as  'SEND'  or  'CALL'  to  add  instrumentation.  The  addressing 
mechanisms  used  to  access  instruction  streams  and  message  arguments  have  also 
evolved  through  simulation  and  by  coding  many  of  the  run-time  system  routines. 
Layout  studies  for  critical  components  of  the  MDP  are  currently  under  way. 


Bidirectional  Torus  Router 

The  Bidirectional  Torus  Router  (BTR)  is  a  self-timed,  multicomputer  communica¬ 
tion  chip  that  we  are  using  to  experiment  with  techniques  for  improving  the 
performance  of  multicomputer  communication  networks. 

One  idea  we  are  testing  with  the  BTR  is  using  virtual  channels  to  multiplex 
two  logical  communication  networks  on  a  single  physical  network.  The  BTR 
provides  two  classes  of  communication,  user  and  system,  that  have  separate 
buffers  but  share  the  same  physical  wires.  The  two  networks  are  logically 
completely  separate.  Our  system  software  uses  this  separation  to  provide 
priority  service  for  critical  messages.  Also,  if  the  user  network  becomes 
congested  (perhaps  due  to  queue  overflow  in  a  processing  node)  we  transmit 
emergency  messages  on  the  system  network  to  clear  the  problem. 


We  also  plan  to  use  the  BTR  to  evaluate  the  use  of: 
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1.  A  Galois  counter  to  reduce  routing  delay.  The  Galois  counter  can  decrement 
in  a  single  gate  delay  as  opposed  to  about  8  gate  delays  for  a  previous  chip 
(TRC)  using  an  integer  counter. 

2.  A  single  wire  request/acknovledge  line  to  reduce  pin  count. 

3.  A  distributed  token-passing  arbiter.  This  arbiter  allows  us  to  efficiently 
multiplex  two  directions  of  communication  on  a  single  channel.  As  long  as  one 
chip  holds  the  token,  it  can  use  the  channel  without  paying  a  round-trip  delay 
to  the  other  chip  for  arbitration. 

4.  A  paititioned  uata  path  rather  than  a  crossbar  switch  to  reduce  router 
area. 

During  the  past  six  months  we  have  completed  a  revision  of  the  BTR  logic 
design  and  have  begun  simulations  to  verify  the  design. 


Performance  of  k-ary  n-cube  Interconnection  Networks 

Ue  have  studied  the  behavior  of  k-ary  n-cube  interconnection  networks  under 
varying  traffic  using  both  simulation  and  queuing  models.  We  have  developed 
an  analytical  model  of  latency  as  a  function  of  offered  traffic  in  unbuffered 
k-ary  n-cube  interconnection  networks  that  agrees  with  network  simulation 
results  to  within  5%.  Both  simulation  and  analysis  indicate  that  latency 
grows  slowly  with  offered  traffic  until  a  saturation  point  of  about  50% 
capacity,  this  result  implies  that  the  latency  advantage  of  low-dimensional 
k-ary  n-cubes  is  not  degraded  by  traffic  as  long  as  the  network  is  operated 
below  the  saturation  point. 
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Abstract:  This  paper  investigates  limitations  on  the  frequency  response  of  networks 
constructed  out  of  components  specified  by  their  small  signal  models.  Tellegen’s  theorem  is 
used  to  find  the  maximum  frequency  of  oscillation.  A  test  on  the  complex  non-Hermitian 
matrix  Y,  based  on  the  convexity  of  the  numerical  range,  is  developed  to  determine  if 
the  quadratic  form  0  =  vHYv  has  a  nonunique  solution  for  v.  We  show  that  v  =  0  is 
unique  iff  Yejd  +  is  positive  definite  for  some  9  6  [0,  2tt).  Transistor  and  negative 

resistance  amplifier  examples  are  developed. 

1  Introduction 

The  frequency  limitations  of  a  circuit  are  determined  by  the  frequency  characteristics  of 
its  components  and  the  cleverness  with  which  those  components  are  interconnected.  It  is 
possible  to  discover  limitations  and  bounds  on  the  frequency  response  of  a  circuit  based 
only  on  the  characteristics  of  the  components.  For  instance,  the  natural  frequencies  (poles 
and  zeros  of  the  admittance)  of  any  one-port  constructed  only  of  linear  passive  resistors 
and  capacitors  must  lie  on  the  negative  real  axis  of  the  s-plane  (w  =  0  and  a  <  0, 

where  s  =  a  +  ju>)  [l].  In  an  example  of  more  relevance  today,  we  can  examine  a  circuit 
composed  of  incrementally  active  devices,  such  as  transistors,  and  ask,  what  is  the  range  of 
natural  frequencies  achievable  by  linear  time-invariant  autonomous  circuits  built  of  these 
components? 

Figure  la  illustrates  a  simple  model  for  a  MOS  transistor  together,  in  Fig.  lb,  with 
the  permitted  natural  frequencies  of  oscillation  of  networks  constructed  solely  of  these 
devices.2  None  of  these  networks,  regardless  how  cleverly  constructed,  can  have  purely 
sinusoidal  natural  frequencies  of  oscillation  above  wmax.  This  example  was  adapted  from 
the  work  of  Thornton  [2,  3].  In  Fig.  lb,  R  =  0  and  the  permitted  and  forbidden  regions 
are  separated  by  the  line 

C  =  gm  _  U29dCgs 
IqdCgs 


1  This  research  was  supported  by  the  Defense  Advanced  Research  Projects  Agency 
under  contract  number  N00014-80-C-0622. 

2  For  left-half-plane  frequencies  the  natural  frequencies  of  “oscillation”  have  voltage 
solutions  that  decay  exponentially  with  time. 
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The  reason  poles  can  lie  on  the  ju  axis  without  inductors  is  that  the  transistors  can  be  used 
to  make  gyrators  which  enable  capacitors  to  emulate  inductors.  Note  how  the  permitted 
natural  frequencies  of  oscillation  in  Fig.  lb  reduce  to  that  of  an  RC  network  as  gm  goes 
to  zero. 

The  natural  frequencies  of  oscillation  of  a  transistor  are  closely  related  to  the  frequen¬ 
cies  at  which  that  transistor  can  be  active.  Mason  was  the  first  to  develop  a  figure  of  merit 
for  a  transistor  in  a  lossless  reciprocal  embedding,  with  his  work  on  the  unilateral  gain  of 
a  linear  two  port  [4].  Later,  Thornton  published  his  work  on  the  allowed  natural  frequen¬ 
cies  of  active  RC  networks  [5],  relating  it  back  to  Mason’s.  Kuh  and  Desoer  generalized 
both  works  by  expanding  the  class  of  networks  considered  and  allowing  nonreciprocal  em¬ 
beddings  [6,7].  In  this  paper,  the  class  of  networks  considered  will  be  further  expanded, 
systematized,  and  extended  in  a  way  suggestive  of  a  simple  computation  to  test  a  com¬ 
plex  frequency  to  determine  whether  or  not  it  can  be  a  natural  frequency  of  a  network 
composed  of  any  interconnection  of  components  specified  by  their  small  signal  models.  A 
general  statement  of  the  problem  is: 

Given  a  set  of  linear  components,  taken  in  any  multiplicity  and 
impedance  scaled  by  any  positive  real  number,  what  is  the  lowest  si¬ 
nusoidal  frequency,  or  more  generally  the  largest  region  in  the  s-plane, 
for  which  one  can  prove  that  connected  networks  built  of  these  com¬ 
ponents  cannot  sustain  oscillation. 

The  first  step  is  to  reformulate  the  problem  in  terms  of  Tellegen’s  theorem  [8]  to  achieve 
increased  generality.  This  makes  it  possible  to  express  the  problem  in  terms  of  any  set  of 
components  S,  characterized  by  their  small  signal  models  and  interconnected  by  wire.  This 
generalization  is  important  because  the  performance  of  most  modern  devices  is  dominated 
by  a  complex  network  of  parasitics.  To  neglect  these  parasitics  is  to  be  overly  optimistic, 
such  as  in  the  case  of  Fig.  1  where  the  use  of  inductors  can  increase  u;max  to  oo.  The 
reason  u/max  can  be  increased  without  bound  in  this  simple  model  is  that  all  of  the  internal 
capacitors  are  directly  connected  across  external  terminals  and  hence  can  be  resonated 
out,  at  any  desired  frequency,  with  a  parallel  inductor.  This  clearly  nonphysical  prediction 
occurs  because  the  model  used  was  excessively  simple.  Note  that  amax  does  not  increase 
because  both  inductors  and  capacitors  absorb  power  for  all  time  when  excited  with  a 
growing  exponential  voltage  or  current  waveform  (in  this  problem,  u>  =  0  at  <rmax). 

The  Tellegen  formulation  leads  to  a  complex  nonHermitian  matrix  Y(s)  which  cap¬ 
tures  the  relevant  frequency  domain  information.  In  this  matrix,  each  type  of  component 
need  be  represented  only  once.  A  simple  test  on  Y (s)  will  then  be  presented  which  can  dis¬ 
cover  whether  or  not  nonzero  voltage  and  current  solutions  are  permitted  at  a  frequency  s. 
From  this  we  can  discover  regions  of  the  s-plane  where,  independent  of  the  circuit  topology, 
the  network  cannot  oscillate.  Since  the  synthesis  problem  is  not  addressed,  one  can  not  use 
this  theory  to  predict  that  all  regions  of  the  s-plane  in  which  oscillation  is  permitted  are 
achievable  with  a  realizable  circuit.  Thus  these  results  have  a  fundamental  limits  flavor  in 
which  one  can  say  definitively  what  is  forbidden  but  not  what  is  necessarily  realizable. 

We  have  motivated  this  investigation  with  the  goal  of  mapping  out  the  s-plane  into 
forbidden  and  permitted  regions.  In  the  next  three  sections  we  will  focus  the  discussion 
on  the  subproblem  of  examining  a  specific  point  in  the  plane,  so,  and  discovering  whether 
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or  not  it  is  in  the  forbidden  region.  In  Section  5  (Examples)  we  again  look  at  the  s-plane 
as  a  whole  but,  instead  of  mapping  it  completely,  we  will  parameterize  it  in  terms  of  the 
two  points  where  the  line  separating  the  forbidden  and  permitted  regions  intersect  the  read 
and  imaginary  axis:  crmax  and  wmax. 

2  Tellegen  Formulation  and  the  Conservation  of  Complex  Power 

One  of  the  many  special  cases  of  Tellegen’s  theorem  states  that  the  inner  product  of  branch 
voltages  v  and  branch  currents  i  is  zero,  v  is  the  column  vector  of  Laplace  transforms  of 
branch  voltages  and  i  is  the  column  vector  of  Laplace  transforms  of  branch  currents,  with 
associated  reference  directions  imposed.  The  branches  are  numbered.  We  have 

0  =  vH(s)i(s),  (2) 

where  xH  is  defined  as  the  complex  conjugate  transpose  of  x.  Equation  (2)  can  be  in¬ 
terpreted  as  the  conservation  of  complex  power.  The  fact  that  two  real  quantities  are 
conserved  in  (2) — the  physically  intuitive  conservation  of  real  power  and  the  more  enig¬ 
matic  conservation  of  imaginary  power — makes  this  problem  somewhat  nonstandard. 

Tellegen’s  theorem  for  a  network  can  also  be  expressed  in  terms  of  the  port  or  terminal 

voltages  and  currents  of  its  subnetworks  [9].  Let  S  =  {Mi ,...,Mm}  be  any  set  of  linear 
multiports,  not  necessarily  all  of  a  size,  characterized  by  an  associated  set  of  admittance 

matrices  y  =  {Yx(s), . .  .,Ym(s)}.  For  each  component  k  we  have 

ifc  =  YfcVjfc.  (3) 

Let  M  be  any  network  obtained  by  producing  a  connected  network  from  the  elements 

of  S  and  ideal  wire,  without  violating  the  port  assumptions  of  the  admittance  matrices. 
Tellegen’s  theorem  for  M  states 

0  =  X)  vf  **  (4) 

k=l 

or 

M 

°  =  Dv£y*v*-  (5) 

k 

It  is  helpful  to  rewrite  (5)  in  block  diagonal  form 

0  =  xffYx,  (6) 


where 


xr  =  (vf,...,vk) 


A  /Yi  0  0  \ 

Y(s)  =  0  Y*  0  . 

VO  0  Y  MJ 


(7) 

(8) 
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Equation  (6)  is  a  quadratic  form  but  note  that  Y  is  typically  nonHermitian.  A  physical 
interpretation  of  Y  is  shown  in  Fig.  2.  The  network  it  represents  can  be  thought  of  as  a 
compound  component  containing  one  of  each  different  subcomponent.  By  wiring  its  ports 
one  obtains  a  network  .M. 

In  deference  to  the  leverage  of  integrated  circuit  technology,  it  is  greatly  desirable 
to  generalize  the  set  of  networks  considered  by  the  theory  to  networks  which  include, 
not  just  one,  but  any  number  of  instantiations  of  the  components  from  the  set  S.  Let 

§  =  {£1, . . . ,  Cn}  be  any  set  of  multiports  characterized  by  admittance  matrices  that 
are  non- negative  scalar  multiples  of  matrices  from  the  set  y.  Let  SJ  be  any  connected 
network  obtained  by  interconnecting  £ i,...,£jv  using  using  only  ideal  wire  and  ideal 
transformers.  Let  al5...,a;v  €  RN  represent  the  non-negative  admittance  scaling  factors 
on  the  admittance  matrix  from  Y.  Tellegen’s  theorem  for  SI  can  be  written 

N 

0  =  a‘X i  (s)Y(s)x<(s).  (9) 

t 

Equation  (9)  contains  an  explicit  summation  over  compound  component  instantiations 
and  an  implicit  sum  of  subcomponent  types.  The  physical  interpretation  of  (9)  is  one  of 
multiple,  admittance  scaled,  instantiations  of  the  network  in  Fig.  2,  wired  together  in  an 
arbitrary  manner.  Subcomponents  which  are  not  needed  have  their  ports  shorted.  With 
a  change  of  variables,  (9)  may  be  rewritten 

°  =  Sw*(5)Y(5)wi(5)>  (10) 

i 

where 

W  i(s)  =  \/a7x,(s).  (11) 


3  Main  Result 

Limitations  on  the  frequency  behavior  of  U  can  be  discovered  by  investigating  the  frequen¬ 
cies  s  at  which  (9)  has  nontrivial  voltage  solutions.  No  network  SI  can  be  constructed  to 
have  a  natural  frequency  so  €  C  unless  there  is  a  nonzero  voltage  solution  to  (9).  Using 
this  criterion  we  can  divide  the  s-plane  into  forbidden  and  permitted  regions;  see  Fig.  lb. 

It  is  desirable  to  have  a  simple  test  on  the  set  of  admittance  matrices  1/  which  tells 
whether  so  is  in  the  permitted  or  forbidden  region.  If  Y  was  Hermitian  (or  even  antiHer- 
mitian)  then  testing  to  see  if  (6),  a  special  case  of  (9),  has  a  nonzero  solution  v  would 
be  straightforward.  If  Y  is  either  positive-  or  negative-definite  then  no  nontrivial  solu¬ 
tions  exist.  Standard  tests  for  definiteness  include  looking  at  the  eigenvalues,  pivots,  or 
subdeterminates  of  Y  (10). 

We  can  examine  necessary  conditions  for  (6)  to  have  a  solution  by  looking  separately 
at  the  Hermitian  and  antiHermitian  parts  of  (6).  Let 

Y»Si(Y  +  Yff)  (12) 
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and 


(13) 


Yxb=  j(Y-Y»), 

where  Y h  is  the  Hermitian  part  of  Y  and  Y ah  is  the  antiHermitian  part.  Testing  these 
parts  separately,  however,  generates  an  overly  optimistic  prediction  of  the  size  of  the 
permitted  region  because,  for  instance,  when  looking  at  Y u  we  are  allowing  vhYahv  to 
take  on  any  (imaginary)  value.  Said  another  way,  if  "14  is  the  solution  set  to  0  =  Y /?v^, 
excluding  =  0,  and  Vah  the  solution  set  to  0  =  v^YxjyVah,  excluding  \ah  =  0,  the 
solution  set  to  (6)  is  the  intersection  of  T4  and  Vah,  which  can  easily  be  empty  even 
when  "14  and  are  nonempty.  In  (6)  we  are  looking  for  simultaneous  solutions  to  two 
quadratic  forms. 

While  definite  tests  on  the  Hermitian  part  of  Y  is  not  quite  what  we  want  to  do,  it  is 
close.  The  following  theorem  is  the  basis  for  a  simple  method  of  computing  the  boundary 
separating  the  forbidden  and  permitted  regions. 

Theorem  1  (Main  result) 

Let  y  =  {Yi(s), ...,  Ya* (s)}  be  any  finite  set  of  admittance  matrices  (not  necessarily  of 
the  same  size).  Let  Zu  —,£n  be  any  set  of  linear  multiports  characterized  by  admittance 
matrices  that,  are  non-negative  scalar  multiples  of  matrices  from  the  set  ]/;  i.e.,  for  each 
t  €  {l,...,iV}  the  admittance  matrix  of  Ci  is  of  the  form  a,Yjfe(s)  for  some  k  G  {l,...,Af}, 

where  a*  €  R  is  non-negative.  Define  Y(s)  =  diag{Yi(s), ...,  Ym(s)}-  Let  M  be  any 
connected  network  obtained  by  interconnecting  Ci,...,Cn  using  only  ideal  (multiwinding) 
transformers  and  ideal  connecting  wire,  and  so  €  C  be  any  complex  frequency  not  a  pole 
or  zero  of  Y(s).  If 

=  \  (Y(so)e^  +  Y*(s0)e-'*)  (14) 

is  positive  definite  for  some  0  G  [0, 2n),  then  so  is  not  a  natural  frequency  of  M.  If  A(0 ,  s0) 
is  not  positive  definite  for  any  0  G  [0,27r)  then  there  exists  a  nonzero  voltage  solution 
Xi(so),...,xjv(so)  to  th®  Tellegen  statement  of  conservation  of  complex  power 

N 

0  =  ^2  a»x f  (*o)Y  (s0)x<  (so) .  (9) 


Note  that  the  relatively  straightforward  test  described  above  suffices  to  demonstrate 
that  so  is  not  a  natural  frequency  of  a  remarkably  large  class  of  networks:  the  class  of  all 
networks  made  of  interconnections  of  any  number  of  multiports  described  by  admittance 
matrices  in  the  set  ]/,  or  positive  scalar  multiples  of  admittance  matrices  in  y.  There  is 
no  requirement  that  N  <  M,  i.e.,  the  number  of  elements  in  the  network  is  unlimited. 

A  restatement  of  Theorem  1  is  more  suggestive  of  an  algorithm:  If  Yfe(so)ejd  + 
Y*r(so)«-,'tf,  the  Hermitian  part  of  each  phase-shifted  subcomponent  admittance  matrix, 
is  positive  definite  for  all  Y*(so)  €  ]/  at  the  same  0  then  no  network  made  from  the  el¬ 
ements  of  ]/,  taken  in  any  multiplicity  and  admittance  scaled  by  any  non-negative  real 
number,  can  have  a  natural  frequency  sq  €  C. 
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4  Proof  of  Main  Result 

The  proof  of  Theorem  1  rests  on  Proposition  1  below,  which  is  a  special  case  of  Theorem  1. 
Proposition  1 

Let  Ci,...,  Cm  be  linear  multiports  characterized  by  matrices  Yi(i),...,Yw(i),  respec¬ 
tively.  Let  M  be  any  network  created  by  interconnecting  C\,...,Cm  using  only  ideal 
(multiwinding)  transformers  and  ideal  connecting  wire.  Define 

Y(i)Sdiag{Y1(a),...,YM(s)}  (8) 

and 

A {9, sQ)  =  l-  (Y(so)e^  4-  YH{sQ)^e)  (14) 

as  before.  For  a  complex  frequency  so,  not  a  pole  or  zero  of  Y(s),  s o  is  not  a  natural 
frequency  of  M  if  A(0,so)  is  positive  definite  for  some  0  6  [0,2 tt). 

Note  that  the  assumptions  of  Proposition  1  are  more  restrictive  than  those  of  The¬ 
orem  1  in  that  the  network  elements  of  At  are  precisely  those  with  admittance  matrices 
Yx(s), ..., Ym(s).  At  contains  exactly  one  element  with  admittance  matrix  Yi(s),  one 
element  with  admittance  matrix  Y2(s),  etc.  The  proof  of  Proposition  1  rests  on  Lemma  1 
below. 

Lemma  1 

Let  B  €  Cnxn  be  any  complex  matrix.  The  solution  v  =  0  of 

0  =  vhBv  (15) 

is  unique  iff 

A(0)  =  \  (Be'*  +  BHe-'*)  (16) 

It 

is  positive  definite  for  some  0  €  [0, 2n) . 

The  proof  of  Lemma  1  rests  on  the  definitions  and  facts  below. 

Definition  1 

The  numerical  range  of  a  complex  matrix  B  €  CnXn  is  the  set  W(B)  of  all  complex 
numbers  of  the  form  xHBx,  where  x  varies  over  all  vectors  on  the  unit  sphere,  xHx  =  1 
[11-14]. 

Figure  3a  illustrates  a  possible  numerical  range  for  Y.  This  case  is  an  example  of 
a  situation  in  which  the  Hermitian  and  antiHermitian  parts  of  Y  test  as  indefinite  yet 
0  £  W(Y).  In  the  figure,  AminH  and  AmaxH  represent  the  smallest  and  largest  eigenvalues 
of  Y h  and  Am;nAH  and  AmaXAH  represent  the  smallest  and  largest  eigenvalues  of  Y ah, 
respectively.  Fig.  3b  illustrates  a  rotation  of  Y  which  causes  A(0)  to  test  positive  definite. 
Definition  2 

A  subset  Q  €  C  is  said  to  be  convex  if  for  A  €  R,  (1  —  A)ar  +  Ay  6  $  whenever  x  €  §, 
y  €  Q,  and  0  <  A  <  1. 
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Fact  1 

The  Toeplitz-Hausdorff  theorem  states  the  remarkable  fact  that  the  numerical  range  of  a 
matrix  is  always  convex  [15]. 

Lemma  1  rests  on  Lemmas  2  and  3  below. 

Lemma  2 

Let  B  G  CnXn  be  any  complex  matrix.  The  solution  v  =  0  of  0  =  vHBv  is  unique  iff 
OgtV(B). 

Proof  of  Lemma  2 

Let  || v)|2  denote  the  L2  inner  product  \Hv  and  let  S  denote  the  unit  sphere  in  Cn,  i.e., 
S  =  {v  G  Cn  |  || v||  =  1}.  Clearly  (15)  has  a  nonzero  solution  v  iff  (15)  has  a  solution 
v  =  v/||v||  lying  in  S.  Since  v  =  0  is  always  a  solution  to  (15),  it  is  unique  iff  0  £  W(B). 

Lemma  3 

A  closed  convex  set  K  in  the  complex  plane  does  not  contain  the  origin  iff  there  exists  a 
rotation  about  the  origin  that  carries  all  of  K  into  the  the  open  right-half-plane. 
Discussion  of  Lemma  3 

This  Lemma  is  obvious  but  the  interested  reader  could  construct  a  proof  by  noting  that 
two  closed  convex  sets,  in  our  case  K  and  the  origin,  containing  no  points  in  common 
must  lie  on  opposite  sides  of  a  separating  hyperplane  (in  this  case  a  straight  line)  [16].  To 
show  that  a  rotation  about  the  origin  carries  all  of  K  into  the  open  right-half-plane  it  is 
sufficient  to  show  that  there  exists  a  rotation  which  brings  the  line  into  the  open  right  half 
plane,  parallel  to  the  imaginary  axis,  with  the  origin  (which  hasn’t  moved)  on  one  side  and 
K  on  the  other. 

Proof  of  Lemma  1 

By  Lemma  2,  (15)  has  a  nonzero  solution  v  G  Cn  iff  0  G  W (B).  Equivalently,  0  0  W(B) 
\SW{Be?9)  is  a  subset  of  the  open  right-half  plane  for  some  9  G  [0,27r),  by  Lemma  3.  Let 

D  =  Be-7*  and  A  =  (D  +  D^)/2  be  the  Hermitian  part  of  D.  For  any  matrix  D  €  CnXn, 
JF(A)  is  the  projection  onto  the  real  axis  of  W (D).  The  trivial  solution  to  (15)  is  unique 
iff  W(  A,  9)  is  contained  in  the  strictly  positive  reals,  i.e.,  iff  A(0)  is  positive  definite  for 
some  9  6  [0,  2tt)  .  | 

Proof  of  Proposition  1 

Proposition  follows  directly  from  Lemma  1  and  Tellegen’s  Theorem,  where  we  have  iden¬ 
tified  the  block  diagonal  matrix  Y(s)  of  (6)  with  the  matrix  B  of  Lemma  1.  | 

The  generalization  of  Proposition  1  to  Theorem  1  requires  the  fact  below. 

Fact  2 

Define  B  €  Cnxn  as  the  block  diagonal  matrix  B  =  diag(Bi,..., B^)  where  {Bx,...,  Biv} 
is  a  set  of  complex  square  matrices.  W (B)  is  the  convex  hull  of  the  set  of  numerical  ranges 
WBx),...,VV(Biv)}[l5]. 

Note  that  W(B)  represents  the  numerical  range  of  the  sum  of  quadratic  forms 
where  £llx»ll2  =  1.  Clearly,  for  any  complex  matrix,  W(diag(B, ...,  B))  = 

W(B). 
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Proof  of  Theorem  1 

From  Tellegen’s  Theorem,  no  network  U  can  have  a  natural  frequency  so  if  (9)  has  a  unique 
solution  Xi,  =  0.  With  a  change  of  variable,  (9)  has  a  unique  solution  at  the  origin 

iff  (10)  has  a  unique  solution  at  the  origin  (unless  all  of  the  admittance  scaling  factors  a, 
are  zero,  in  which  case  there  is  no  network).  From  Lemma  1,  (10)  has  a  nonzero  solution 
iff  0  €  W(diag(Y,...,Y)),  but  from  Fact  2,  W(diag(Y, ...,  Y))  =  W{ Y).  The  remainder 
the  the  proof  follows  in  the  same  manner  as  the  proof  of  Proposition  1.  | 

Further  insight  into  this  theorem  may  be  gained  by  looking  at  the  Hermitian  parts  of 

the  rotated  admittance  matrices  in  ]/.  Let  A  =  {Ai,...,  Am}  be  a  finite  set  of  Hermitian 
matrices  in  one  to  one  correspondence  with  the  admittance  matrices  in  y .  Let 

As(«^o)  =  l(Y^,+Yt»e--’»)  (17) 

for  all  Yfc  G  y.  Let  0  G  [0,  27t)  be  a  rotation  for  which  A(0,  so)  is  positive  definite.  By 
Lemmas  1  and  2,  W (Ye]d)  lies  entirely  in  the  right  half  plane  and  by  Fact  2  each  W (Y,eJ*) 
also  lies  in  the  right  half  plane  for  all  Y,  G  y .  Therefore  each  Ak(§,so)  is  positive  definite 
for  all  A*  G  A.  Thus  all  subcomponents  used  in  M  (those  in  which  a,-  >  0)  result  in  a 
strictly  positive  real  contribution  to  the  sum  £)atx^AXj  unless  their  port  voltages  are 
zero.  The  only  way  for  this  sum  to  equal  zero,  as  it  must  by  Tellegen’s  Theorem,  is  if  all 
port  voltages  are  zero. 

It  is  worth  observing  that  the  first  half  of  the  Theorem — the  part  which  details  a 
sufficient  test  to  prove  that  so  is  not  a  naturally  frequency  of  U — can  trivially  be  derived 
from  (6)  by  multiplying  each  side  of  the  equality  by  exp(jfl)  and  taking  the  real  part.  It  is 
the  second  half  of  the  theorem — the  part  that  states  that  if  a  9  cannot  be  found  to  make 
A(0,  s0)  positive  definite  then  a  nonzero  voltage  solution  for  (9)  exists — that  requires  the 
convexity  of  the  numerical  range.  In  other  words,  this  paper  is  concerned  with  investigating 
limits  no  less  strict  than  those  which  arise  from  the  conservation  of  real  and  imaginary 
power. 

When  solutions  to  (6)  are  found  at  the  same  s  that  Y  is  singular,  further  inspection  of 
the  result  is  required.  When  Y  is  singular,  voltage  solutions  v  can  be  sustained  with  zero 
current.  These  solutions,  in  which  no  real  or  reactive  power  flow  through  the  network,  are 
generally  of  little  interest  from  the  standpoint  of  this  work.  For  the  same  reason,  when 
formulating  Y,  the  definite  form  of  the  admittance  matrix  must  be  used. 

5  Examples 

In  this  section  we  will  go  through  three  examples  to  illustrate  the  scope  and  utility  of 
the  theory.  The  first  example  is  a  negative-resistance  reflection  amplifier  constructed  from 
the  three  elements  illustrated  in  Fig.  4.  Figure  4a  shows  a  negative  resistance  device  with 
capacitive  and  resistive  parasitics.  This  problem  is  best  formulated  in  terms  of  impedances 
rather  than  admittances.  The  impedance  of  the  amplifier  is 

ZAmp(s)  =  Rs  + 
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Fig.  4b  illustrates  the  inductors  available  to  tune  out  the  reactance  of  Zamp.  The  inductors 

have  a  parasitic  series  resistance  Rl,  where  ul  =  Rl/L  is  a  constant  of  the  technology. 
We  have 

Zl{&)  =  Rl  +  sL  =  (ui  4-  s)L.  (19) 

The  third  element  type  we  have  in  the  circuit,  also  illustrated  in  Fig.  4b,  is  the  passive 
resistor:  Zr(s)  =  Rr.  The  Z  matrix  for  S  is 

{ Zimp  0  0  \ 

Z  =  0  ZL  0  .  (20) 

l  0  0  ZrJ 

Since  Z  is  diagonal,  the  eigenvalues  appear  on  the  diagonal  and  the  eigenvectors  are  or¬ 
thogonal.  Solving  for  W(Z), 

W(Z)  -  2»mpl|xi||2  +  ZlWM2  +  Zs||x3f,  (21) 

where 

llxill*  +  INII*  +  llxaf  =  IMP  =  1-  (22) 

Xi,  x2,  and  X3  are  the  eigenvectors  of  Z.  Figure  5a  illustrates  the  numerical  range  of  Z. 
The  dashed  lines  indicate  how  Zamp  and  Zl  move  with  increasing  cj  (a  =  0).  After  some 
algebra,  one  finds 


This  solution  is  found  at  an (5)  =  a22(0)  =  0.  In  general,  det  A(0)  can  be  represented 
as  a  Fourier  series  with  frequency  components  from  —  N  to  N  where  N  is  the  rank  of  A. 
Computational  advantage  can  be  gained  by  also  noting  that 

M 

detA  =  detAfc.  (24) 

k=\ 

With  Rl  =  0,  (26)  reduces  to  the  more  optimistic  expression  for  wmax  one  would  obtain  by 
simply  examining  the  Hermitian  part  of  Z.  With  the  element  values  of  Table  1,  wmax  =  2 
is  predicted  if  complex  power  is  conserved,  but  wmax  =  3  is  predicted  if  only  real  power  is 
conserved  (i.e.,  looking  at  only  the  Hermitian  part  of  Z  or  only  5  =  0). 


element 

realistic 

ideal 

units 

inductor 

inductor 

Rs 

0.1 

0.1 

n 

G 

1 

1 

rr1 

C 

1 

1 

F 

Rl 

0.5 

0 

n 

L 

1 

1 

H 

Rr 

1 

1 

n 

UL 

0.5 

0 

s-1 

wmax 

2 

3 

s-1 

Table  1 


The  numerical  range  of  A(0)  is  real;  it  is  illustrated  in  Fig.  5b  for  u  =  2.5.  It  lies  between 

Zm?e>9 

ZLt (25) 
Zr J9 

ZLe*$  •  (26) 

ZRe>9 

Note  that  Am;n(A(fl))  can  contain  local  maxima  and  points  at  which  the  derivative  with 
respect  to  9  is  undefined. 

The  second  example  we  consider  is  interconnections  of  the  three-terminal  MOS  tran¬ 
sistor  illustrated  in  Fig.  la.  For  this  circuit, 


the  minimum  and  maximum  eigenvalues; 


Amin  (A)  =  inf  Re 


and 


Amax(A)  —  sup  Re 


_ aC  0  A 

»RC+ 1  u  1 

aC+1  3D  J 


(27) 


Note  that  the  use  of  the  source  node  as  the  datum  does  not  imply  any  restriction  to 
common  source  configurations.  A  datum  must  be  chosen  so  that  a  definite  admittance 
matrix  can  be  developed.  An  indefinite  admittance  matrix  would  always  admit  nonzero  v 
solutions  since  an  arbitrary  constant  could  be  added  to  all  node  voltages.  Note  that  the 
numerical  range  of  any  2x2  matrix  is  a  closed  elliptical  disk  with  foci  at  the  eigenvalues. 

To  aid  our  pursuit  of  closed-form  solutions,  we  separate  Y  into  Hermitian  and  anti- 
Hermitian  parts.  Rewriting  A (0), 


A (9)  =  cos  9  (Yff  +  jY ah  tan  9) . 


(28) 
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For  our  purposes,  the  cos  9  scaling  factor  on  A(5)  is  inconsequential  and  may  be  neglected. 
Solutions  for  A  at  cos  5  =  0  are  generally  not  of  interest  because  this  would  imply  that 
the  conservation  of  real  power  was  unimportant.  We  define 


A'(<)  =  i  (Y„  +  y«Yxw) , 
where  6  =  tan  5.  Fcr  s  =  a  we  obtain 


(29) 


A'M  =  2 


and  for  s  =  ju  we  obtain 


A'W  =  5 


f  <ri?c+i  (!  +  j6)  'iic+i 

V  (1  rf)  <trc+i 

~  i+ywHc  ^9d 


(30) 


(31) 


Since  >  0,  the  determinant  test  for  positive  definiteness  only  requires  examining  the 
detA'.  Note  that  this  was  not  the  case  in  the  reflection  amplifier  example.  detA'  is 
quadratic  in  6.  Solving  for  detA'  =  0  (the  frequency  at  which  A'  becomes  positive  definite) 
leads  to 


^max  — 


-i  ±  J^+i±±*n&E 

V  9d 


2  RC 


(32) 


and 


Wmi*  = 


-5  ±  ^2 +  &£(!  + 52) 

2  RC 


(33) 


Optimizing  with  respect  to  6  to  find  the  min-max  values  for  crmax  and  wmax, 

-I  +  V/T7igr 

^max  — 

and 


2  RC 


(34) 


wm»x  = 


9m 

2  goC 


(35) 


Unlike  in  the  reflection  amplifier  case,  decomposing  Yv  into  its  eigenvalues  and  eigen¬ 
vectors  is  unproductive  because  the  eigenvectors  are  nonorthogonal.  The  only  useable 
structure  we  have  is  that  W ( Y)  contains  the  spectrum  of  Y. 

The  third  example  is  an  extension  of  the  last.  Here,  realistic  parasitics  and  a  body 
terminal  are  added  to  the  model  of  Fig.  la.  The  new  model,  which  still  does  not  take  into 
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account  distributed  effects  or  source  and  drain  resistances,  is  shown  in  Fig.  6.  Since  the 
simple  model  of  Fig.  la  is  a  special  case  of  the  model  of  Fig.  6,  the  two  can  be  compared. 


Parameter 

Fig.  la 

Fig.  6 

Trivial 

Units 

9m 

7  x  lO"4 

7  x  10“4 

7  x  10~4 

rr1 

9  mb 

0 

2  x  10~5 

0 

Cl-1 

Cgs 

O 

1 

»-* 

10"14 

1 

O 

f-H 

F 

Cgd 

0 

3  x  10~15 

0 

F 

Cbs 

0 

6  x  10~15 

0 

F 

Cbd 

0 

6  x  10~1S 

0 

F 

Cgb 

0 

io-15 

0 

F 

9d 

8  x  10“5 

8  x  10~5 

0 

n-1 

Rg 

1000 

1000 

0 

n 

Wmix 

1.14  x  10u 

2.95  x  1010 

oo 

s-1 

°max 

8.34  x  1010 

1.91  x  1010 

oo 

s_1 

/max  =  Wmax/ 2tT 

18.1 

4.69 

oo 

GHz 

fa 

11.1 

7.95 

Table  2 

11.1 

GHz 

Table  2  gives  the  parameter  values  for  the  simplified  (Fig.  la)  and  full  (Fig.  6)  models. 
The  parameter  values  are  typical  of  MOSFETs  found  in  VLSI  circuits.  The  Fig.  6  values  of 
ffmax  and  wmax  were  obtained  by  computer.  Note  the  large  differences  in  speed  predicted 
by  the  two  models.  For  comparison,  the  transition  frequencies  fr  of  the  two  cases  are  also 
given,  fr  is  the  frequency  at  which  the  magnitude  of  the  output  short-circuit  common- 
source  current  gain  drops  to  1.  The  fa  is 


2irfa  = 


9m 

Cgs  +  Cqd  +  Cqb 


(36) 


Note,  however,  that  the  fa  of  a  transistor,  while  of  proven  usefulness,  is  not  fundamental 
in  the  sense  that  wmax  is  fundamental,  since  it  assumes  a  topology.  This  is  why  the  series 
gate  resistor  Rg  does  not  appear  in  (36)  despite  its  obvious  importance  to  most  practical 
circuits.  The  shortcomings  of  fa  are  accentuated  in  the  fourth  column  of  Table  2.  While 
the  fa  is  finite,  the  maximum  frequency  of  oscillation  is  infinite.  A  circuit  which  can  realize 
current  gain  at  any  frequency  is  illustrated  in  Fig.  7.  The  trick  to  this  circuit  is  that  the 
gate  capacitances  are  added  in  series  while  the  drain  currents  are  added  in  parallel.  By 
using  many  transistors  one  can  make  the  effective  fa  as  high  as  desired.  The  fourth  column 
represents  an  exceptional  case.  More  often,  as  in  column  three,  the  fa  is  too  optimistic  a 
predictor  of  circuit  performance. 

Since  wmax  includes  the  effects  of  a 11  parasitics,  I  believe  that  it  will  find  great  utility 
in  the  comparison  and  development  of  competing  technologies — e.g.,  GaAs  MOSFETs  and 
Si  bipolar  transistors. 
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Figures 

1  Simple  small  signal  model  of  a  MOS  transistor  (a)  and  the  permitted  and  forbid¬ 
den  natural  frequencies  (b)  of  networks  built  out  of  resistors,  capacitors,  and  these 
transistors  (R  =  0). 

2  The  collection  of  n-port  components  which  comprise  S,  the  basic  building  block  used 
to  construct  the  circuits  M  considered  in  the  analysis. 

3  The  numerical  range  of  Y.  In  (a)  the  Hermitian  and  antiHermitian  parts  of  Y  are 
indefinite  yet  zero  is  not  in  the  numerical  range.  In  the  rotated  version  (b)  the 
Hermitian  part  of  Y  is  positive  definite. 

4  A  negative  resistance  amplifier  (a)  and  other  permitted  components  (b). 

5  The  evolution  of  the  numerical  range  of  the  circuitry  in  Fig.  4  as  w  increases  (a)  and 
the  numerical  range  of  A (0)  (b). 

6  A  first-order  MOS  transistor  model. 

7  A  circuit  to  increase  the  effective  fx  of  a  transistor.  The  trick  to  this  circuit  is  that 
the  capacitances  are  added  in  series  while  the  currents  are  added  in  parallel. 
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ABSTRACT  * 

Many  of  the  failure  mechanisms  which  cause  reliability 
problems  in  VLSI  chips  can  be  influenced  or  avoided  in 
the  circuit  design  phase.  RELIC  is  a  reliability  simula¬ 
tor  developed  to  analyze  and  predict  the  stress  and  wear 
on  MOS  VLSI  chips  due  to  such  mechanisms.  RELIC 
uses  a  simple  methodology  for  abstracting  the  idea  of 
the  stress  from  any  particular  failure  mechanism,  thus 
allowing  analyses  of  many  different  failure  mechanisms. 
There  are  currently  three  failure  mechanisms  analyzed  by 
RELIC:  metal  migration,  hot  electron  trapping,  and  time 
dependent  dielectric  breakdown  (TDDB). 


1.  Introduction 

RELIC  is  a  reliability  simulator  developed  to  analyze 
the  stress  and  wear  on  MOS  VLSI  chips.  RELIC  is  de¬ 
signed  to  help  the  circuit  designer  develop  competitive 
and  reliable  chips  with  minimal  extra  effort.  By  using 
RELIC,  chip  designers  can  design,  not  only  for  worst-case 
speed  and  power  Glasser  85  ,  but  also  for  worst-case  reli¬ 
ability.  RELIC  is  not  built  around  any  particular  failure 
phenomena  or  model:  rather,  RELIC  exists  as  a  system 
in  which  many  failure  mechanisms  can  be  simulated  and 
new  models  easily  implemented. 

RELIC  simulates  those  failure  mechanisms  which  are 
under  the  control,  or  at  least  the  influence,  of  the  circuit 
designer.  For  such  failure  mechanisms,  there  exist  models 
which  relate  the  median  time  to  failure  (MTTF)  of  the 
device  to  the  operating  voltages  and  currents  of  the  cir¬ 
cuit  and  the  actual  layout  of  the  chip.  RELIC  employs  a 
circuit  simulator  so  that  the  voltages  and  currents  used 
in  stress  calculations  are  worst-case  operating  waveforms 
and  not  just  the  maximum  voltages  or  currents. 

The  idea  of  reliability  simulation  is  not  a  new  one  and 
software  tools  have  been  proposed  to  check  circuit  designs 
for  certain  reliability  hazards,  such  as  metal  migration  or 
hot  electrons.  A  simulator  to  determine  metal  migration 
was  described  by  Kokkonen  jKokkonen  84] .  Substrate 
current  circuit  simulators  have  been  described  by  Sing 
[Sing  80  and  Sakurai  (Sakurai  85|,  from  which  predic¬ 
tions  about  hot  electron  trapping  can  be  made.  Another 
hot  electron  simulator  has  been  proposed  by  C.  T.  Sah 
[Iversen  861  who  is  currently  developing  a  hot  electron 
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model  from  a  combination  of  theory  and  experimental  re¬ 
sults. 

While  the  simulation  of  failure  mechanisms  has  been 
proposed  previously,  there  are  two  major  differences  be¬ 
tween  earlier  simulators  and  RELIC.  First,  RELIC  calcu¬ 
lates  stress  and  wear  based  on  actual  dynamic  voltages 
and  currents  determined  by  circuit  simulation,  and  not 
on  the  basis  of  worst-case  static  operating  conditions.  In 
addition.  RELIC  is  the  first  reliability  simulator  to  run 
failure  tests  for  several  mechanisms.  RELIC  provides  a 
structure  in  which  existing  models  may  be  easily  changed, 
and  new  models  implemented  with  moderate  effort.  This 
is  accomplished  by  the  use  of  a  simple  methodology  for 
handling  the  stress  and  wear  caused  by  many  different 
failure  mechanisms. 


2.  Stress  and  Wear 


One  of  the  unifying  concepts  used  in  RELIC  is  that 
when  a  device  accumulates  a  certain  amount  of  stress 
over  time  (wear),  it  fails.  This  stress  may  be  in  the  form 
of  trapped  charge  in  the  gate  oxide  causing  breakdown 
(TDDB),  or  in  the  form  of  a  transistor  threshold  shift 
causing  circuit  malfunction.  Regardless  of  which  partic¬ 
ular  failure  phenomena  is  being  testing,  we  can  abstract 
the  idea  of  stress,  and  determine  the  MTTF  of  the  device 
on  the  basis  of  stress  and  circuit  configuration  alone. 


RELIC  predicts  the  reliability  of  the  circuit  by  first  cal¬ 
culating  the  wear  which  each  circuit  device  experiences 
over  one  normal  cycle  of  operation.  We  define  a  normal 
cycle  of  operation  to  be  the  time  it  takes  to  run  the  cir¬ 
cuitry  through  one  routine  of  whatever  it  does.  The  wear 
w  on  each  device  is  calculated  as  the  integral  of  stress  a 
over  time. 

u'(f)  =  /  a(t’)dt'.  (1) 

*  o 

We  assume  a  deterministic  point  of  view  which  says  that 
the  amount  of  wear  a  device  can  stand  until  it  fails  is 
the  critical  value  of  wear  u/(Tfaj|).  u^Tr^i)  ,s  a  random 
variable  with  a  mean  and  variance  which  must  be  deter¬ 
mined  from  tests  on  fabricated  devices.  The  critical  value 
of  wear  may  also  depend  on  how  and  where  a  device  is 
used  in  a  circuit.  For  example,  the  amount  of  hot  electron 
stress  a  device  can  endure  without  failure  depends  on  its 
position  and  function  in  the  overall  circuit. 


Once  we  know  the  critical  value  of  wear  (distribution) 
and  the  stress  rate,  we  can  find  the  time-tofailure,  Tran, 
where 


„  w(Tr»ii) 
4f»i|  =  — rr — 

{») 


(2) 
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and  where  the  average  stress  is 


Note  that  we  are  making  the  assumption  that  the  stress 
rate,  as  determined  tor  one  normal  cycle,  will  remain  con¬ 
stant  tor  the  life  of  the  circuit. 

Another  important  assumption  here  is  that  the  stress. 
s,  is  of  the  form 


s(t)  =  A(t)ebE  (4) 


where  A(t),  b,  and  E,  are  quantities  whose  form  varies 
for  each  failure  model.  E  is  assumed  to  be  approximately 
constant  over  time.  These  assumptions  for  s  are  valid 
because  the  phenomena  which  we  are  considering  all  have 
exponential  character  in  energy  space.  (See  example  of 
metal  migration  model  later.)  Given  the  fact  that  stress 
.s  depends  exponentially  on  E,  if  £  is  a  normal  random 
variable  then  s  is  a  log-normal  variable. 

Substituting  (3)  and  (4)  into  (2),  and  taking  the  log  of 
both  sides,  we  then  have 


In  Trail  =  Inu'(Tfaii)  -  ln  f  ^  fQ  M^dtj  (5) 


which  reduces  to 


In  Tfaii  =  Inui(Tfaii)  -  In {Ay^bE.  (6) 


Therefore,  because  E  was  a  normal  random  variable,  and 
s  was  consequently  log-normal,  then  Trail i  which  depends 
inversely  on  s  also  has  a  log-normal  probability  distri¬ 
bution.  RELIC  assumes  log-normal  distributions  for  all 
failure  mechanisms. 

This  generalized  model  of  stress  is  applicable  for  all 
of  the  failure  models  which  are  currently  implemented  in 
RELIC.  A,  6,  and  E  can  be  determined  from  model  and 
circuit  parameters  and  circuit  simulation.  The  critical 
wear  value  w(Tun)  must  be  determined  experimentally 
and,  as  mentioned  earlier,  could  depend  on  a  device’s  lo¬ 
cation  and  function  in  the  circuit. 

Once  RELIC  has  computed  the  time-to-failure  distri¬ 
bution  for  each  device,  it  then  combines  the  distributions 
for  each  device  to  obtain  the  total  failure  distribution  for 
the  entire  circuit.  Using  the  assumption  that  the  failure 
of  one  device  has  no  effect  on  the  probability  of  failure  of 
any  other  device,  we  then  treat  the  failure  probabilities 
as  being  independent.  Given  that,  then  the  probability 
of  system  failure  P,y, fau  is  T,y,f»u  =  1  -  Pjy,work,  where 
P, y,WOrk  is  the  probability  that  the  system  has  not  failed. 
Assume  that  one  device  failure  is  enough  to  cause  the 


system  to  fail.  Given  an  independent  system,  P,y ,„ork  is 
equal  to  the  product  of  the  probabilities  that  each  device 
is  working.  Thus,  P,y,faii  is  found  to  be 

Tiy,fail  =  l  -  f|(l  -  Pfaili).  (7) 

» 

In  addition,  if  all  of  the  individual  probabilities  for  de¬ 
vice  failure  are  small,  and  the  cross  products  of  (7)  even 
smaller,  we  may  then  use  what  is  known  as  the  Rare  Event 
Approximation.  (  <ing  this  approximation,  the  probabil¬ 
ity  of  system  failure  becomes  the  linear  sum  of  the  indi¬ 
vidual  failure  probabilities  for  each  device.  That  is, 

P iyifail  $  Pfx\\,  •  (8) 

I 

Note  that  this  only  works  if  there  are  a  small  number  of 
devices  ndev  such  that  ndev  ~Pr~- 

*!1'l  i 

3.  System  Structure  and  Implementation 


Figure  1.  Structure  of  the  RELIC  System 


The  RELIC  simulator  consists  of  three  parts:  a  pre¬ 
processor,  a  modified  circuit  simulator,  and  a  post  pro¬ 
cessor,  as  illustrated  in  Fig  1.  We  use  the  circuit  simula¬ 
tor  both  to  simulate  the  circuit  to  determine  voltage  and 
current  waveforms  and  to  calculate  device  stress  based  on 
these  conditions.  The  simulator  in  this  implementation  is 
RELAX2.2  from  the  University  of  California  at  Berkeley 
[White  85|.  What  we  have  done,  basically,  is  to  introduce 
into  RELAX  some  new  device  models,  one  for  each  failure 
mechanism  we  are  simulating. 


Figure  2.  Reliability  Test  Structures 


These  new  models,  which  I  shall  refer  to  as  reliabil¬ 
ity  test  structures  are  connected  to  the  circuit  nodes  of 
the  device  undergoing  testing  to  measure  its  voltages  and 
currents,  and  from  these  operating  conditions,  along  with 
the  device  size  and  processing  parameters,  calculate  the 
instantaneous  stress  on  that  device  [Fig  2j.  Besides  sens¬ 
ing  the  circuit  operating  conditions,  test  structures  also 
employ  circuit  nodes  connected  to  various  configurations 
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of  resistors  and  capacitors  to  calculate  intermediate  quan¬ 
tities.  A  final  node,  the  wear  node,  is  used  to  output  the 
accumulated  stress.  This  node  is  connected  to  a  grounded 
capacitor  and  input  a  current  proportional  to  the  instan¬ 
taneous  stress,  so  that  the  voltage  across  this  node  repre¬ 
sents  the  total  wear  incurred. 

The  RELIC  preprocessor,  PREL,  takes  an  input  file 
which  describes  the  circuit  according  to  RELAX  syntax. 
This  input  file  tnu»t  also  contain  some  RELIC.'  commands 
indicating  which  devices  to  test,  and  for  which  failuri1 
mechanisms.  PREL  modifies  this  file  to  include  the  ap¬ 
propriate  reliability  test  structures  for  each  device  under¬ 
going  testing.  PREL  also  instructs  the  circuit  simulator 
to  output  the  voltage  waveform  for  the  new  wear  node. 
This  new  circuit  configuration  is  then  simulated  by  the 
circuit  simulator  for  one  normal  cycle  of  operation  and 
the  wear  incurred  for  this  cycle  is  output.  Finally,  the 
RELIC  postprocessor  must  compare  this  device  wear  in¬ 
formation  with  critical  failure  wear  data  to  determine  the 
time-to-failure  of  individual  devices  and  the  MTTF  for 
the  entire  circuit.  The  postprocessor  is  currently  under 
development  at  the  time  of  this  writing. 

4.  Failure  Models 

The  current  implementation  of  RELIC  contains  mod¬ 
els  for  metal  migration  Kokkonen  84  ,  hot  electron  trap¬ 
ping  Hsu  82  ,  and  time  dependent  dielectric  breakdown 
Chen  85j.  Presented  here  are  the  salient  features  of  the 
metal  migration  model  which  incorporates  the  effects  of 
IR  heating,  thermal  capacitance,  and  thermal  resistance. 
The  wire  stress  depends  on  tl  wire  temperature  and  cur¬ 
rent  waveforms. 


Figure  3.  Implementation  of  the  Wire  Model 


The  wire  model  is  a  4-node  reliability  test  structure. 
Two  of  the  nodes,  .V/  and  S2,  represent  the  ends  of  the 
wire  and  are  connected  to  the  circuit  nodes  on  either  end 
of  the  wire.  (This  feature  is  not  well-supported  by  layout 
extractors,  which  consider  a  wire  to  be  one  node  in  the 
circuit;  consequently,  the  locations  of  all  wires  the  user 
wishes  to  simulate  must  be  identified  in  the  PREL  input 
file.)  RELIC  models  the  electrical  behavior  of  the  wire 
as  a  pi  model,  having  the  wire’s  resistance  between  /VI 
and  S2  and  half  the  wire  capacitance  from  each  of  these 
nodes  to  ground  [Fig  3|. 

The  wire  model  also  employs  a  node,  /VS,  for  use  in  cal¬ 
culating  an  intermediate  quantity.  The  stress  on  the  wire 
is  dependent  on  the  wire  temperature,  which  is  a  function 
of  the  thermal  capacitance  and  resistance.  The  thermal 
equivalent  RC  circuit  is  also  shown  in  (Fig  3|,  where  the 
voltage  on  /VS  represents  the  change  in  temperature  of  the 
wire. 


the  instantaneous  stress  calculated  by  the  metal  migra¬ 
tion  equations  and,  therefore,  the  voltage  on  Si  is  pro¬ 
portional  to  the  amount  of  metal  migration  stress  on  the 
wire. 

The  stress  on  the  wire  due  to  metal  migration  is  roughly 
expressed  as 

s(0  ^  AJ2\t)e{&'  (9) 

where  .1  ia  related  to  the  wire's  physical  -ize.  J  is  the  cur¬ 
rent  density  through  the  wire.  is  the  activation  energy, 
k  i-  lloltzman's  constant  and  T  is  the  wire  temperature. 
Note  that  this  is  consistent  with  our  assumption  earlier 
that  the  stress  in  our  models  was  exponential  in  energy 
space.  This  is  also  true  in  the  hot  electron  and  TDDIJ 
models.  Details  on  all  of  the  equations  and  parameters 
used  in  RELIC  can  be  found  in  Hohol  Xfj  . 

The  hot  electron  and  TDDB  models  are  implemented 
in  basically  the  same  way  as  the  metal  migration  model. 
Both  of  these  reliability  test  structures  have  sensing  nodes 
which  connect  in  parallel  to  the  source,  gate,  drain,  and 
bulk  nodes  of  the  transistor  being  tested.  Because  these 
test  structures  are  inserted  in  parallel  with  the  circuit 
device  being  tested,  the  user  does  not  have  to  specify 
additional  nodes  in  the  circuit  (unlike  the  wire  model, 
which  is  inserted  serially  and  requires  two  nodes  instead 
of  one). 

5.  Results:  An  Example 


Figure  4.  IBM  Bootstrapped  Superbuffer.  Stressed 


RELIC  was  used  to  analyze  one  version  of  an  IBM 
Bootstrapped  Superbuffer,  shown  in  Fig  l  .  This  cir¬ 
cuit  uses  two  stages  of  bootstrapping  to  in  order  to  drive 
a  large  capacitive  toad.  However,  the  result  of  this  dou¬ 
ble  bootstrapping  is  a  large  amount  of  hot  electron  and 
TDDB  stress  on  transistor  Ml. 


Figure  5.  Hot  Electron  and  TDDB  Wear  on  Ml 


When  the  input  Si  to  the  superbuffer  is  low  and  the 
bootstrap  action  has  been  completed,  the  source  and  gate 
of  transistor  Ml  are  at  Ov  and  the  drain  S2  is  around 


The  final  node  of  the  wire  model  is  Si,  which  is  the 
wear  node.  The  current  source,  SB,  is  proportional  to 
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12.5  v.  This  large  voltage  differential  across  the  oxide  be¬ 
tween  the  gate  .Vl  and  the  drain  creates  large  amounts 
of  TDDB  stress  and  wear  (See  (Fig  5).  When  the  input 
Si  to  the  superbuffer  rises  again,  transistor  Ml  turns 
on,  with  an  initial  drain-to-source  voltage  of  12.5  v.  This 
iarge  saturation  voltage  generates  hot  electron  stress  and 
wear  on  the  gate  oxide  of  Ml  (also  in  Fig  5  ). 


Figure  6.  Metal  Migration  Wear  on  Output  Wire 


In  order  to  test  RELIC  on  a  metal  migration  simula¬ 
tion,  I  added  a  2000  >im  long  wire  with  a  width  of  5  ^m  to 
the  output  node  S3  of  the  superbuffer.  As  this  wire  was 
charged  and  discharged,  the  currents  through  the  metal 
were  found  to  cause  metal  migration  wear.  This  wear  is 
shown  on  the  plot  in  Fig  6  .  Note  that  the  metal  migra¬ 
tion  wear  increases  and  decreases  as  the  current  in  the 
wire  changes  direction,  but  that  the  net  wear  appears  to 
be  in  the  negative  direction.  This  does  not  mean  that 
there  is  negative  wear  on  the  wire,  but  that  the  wear 
caused  by  the  wire  discharging  is  greater  than  the  wear 
endured  during  charging.  This  is  because  the  wire  dis¬ 
charges  faster  than  it  charges,  and  consequently  the  cur¬ 
rents  during  this  period  are  greater  in  magnitude  and 
cause  more  stress  according  to  our  simple  model. 


Figure  7.  IBM  Bootstrapped  Superbuffer,  Modified 


Figure  8.  Wear  Plots  of  Modified  Superbuffer 


An  alternative  IBM  Bootstrapped  Superbuffer  is  shown 
in  Fig  7  .  This  circuit  has  an  added  transistor  MS, 


which  protects  the  drain  of  Ml  from  voltages  higher  than 
Vdi  ~  Vth  (See  (Fig  8|).  The  12.5  v  voltage  drop  is  now 
spread  across  two  transistors  instead  of  one.  Although 
the  drain  of  M2  is  still  at  12.5  v.  the  gate  of  that  transis¬ 
tor  is  at  5  v  which  reduces  to  voltage  across  the  gate  oxide 
to  7.5  v,  which  does  not  register  any  TDDB  stress  for  the 
oxide  thickness  of  125  A.  Similarly,  the  approximately  8  v 
voltage  drop  from  the  drain  to  the  .source  of  M2  does  not 
register  any  hot  electron  effects  (See  Fig  ?  ).  In  order 
to  reduce  the  metal  migration  stress.  1  increased  the  wire 
width  to  IO/<m.  This  causes  the  current  density  in  the 
wire  to  decrease,  and  consequently,  no  metal  migration 
wear  is  measured  (also  in  Fig  8  ). 

6.  Conclusions 

In  this  paper  we  have  presented  RELIC,  a  reliability 
simulator  for  determining  stress  and  wear  on  integrated 
circuits.  RELIC  employs  a  unifying  strategy  for  ab¬ 
stracting  stress  from  individual  failure  mechanisms,  and 
therefore  allows  for  the  analysis  of  many  different  fail¬ 
ure  mechanisms.  The  heart  of  the  RELIC  simulator  is 
a  circuit  simulator,  employed  both  to  determine  instan¬ 
taneous  voltage  and  current  waveforms  and  to  carry  out 
the  actual  equations  for  calculating  stress  due  to  failure 
mechanisms.  The  analyses  carried  out  by  RELIC  on  two 
versions  of  the  IBM  Bootstrapped  Superbuffer  show  how 
this  tool  may  be  used  by  circuit  designers  to  both  identify 
problems  and  verify  their  correction. 
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The  first  formal  M.I.T.  course  on  transistors  was  taught  in  1953.  Dispite  this  early 
start,  for  most  of  the  70’s  the  Institute  maintained  only  a  small  integrated  circuit 
research  effort.  In  1977  M.I.T.  made  a  strategic  decision  to  reemphasize  microsystem 
technology  and  since  that  time  the  size  of  this  program  has  increased  dramatically,  with 
new  faculty  being  hired  each  year.  The  program  is  now  large  and  vibrant,  spanning 
research  areas  from  nanometer  structures  to  multiprocessor  architectures.  One  of 
the  many  areas  which  has  received  recent  attention  is  electrical  issues  in  large  digital 
machines,  where  a  “large”  machine  is  one  whose  physical  dimensions  are  big  compared 
to  the  distance  a  bit  spans  as  it  speeds  across  the  machine.  Within  this  area,  five 
critical  topics  are  being  addressed:  I/O,  synchronization  (e.g.,  clocking),  power,  noise, 
and  reliability.  It  is  this  last  topic  which  is  addressed  in  the  accompanying  article. 

RELIC  is  our  first  attempt  to  design  a  simulation  program  which  enables  the 
engineer  to  design  high-performance  circuits  not  only  for  worst-case  speed,  power,  and 
noise  margin,  but  also  worst-case  reliability.  Our  program  is  the  first  to  support  the 
reliability  simulation  of  a  circuit  stressed  by  several  dynamic  reliability  hazards. 

As  process  and  device  technologies  advance,  the  constraints  that  must  be  dealt  with 
continually  become  more  complex  and  difficult.  Nevertheless,  this  complex  constraint 
space  is  today  reflected  in  the  circuit  domain  as  an  orthogonal  set  of  relatively  simple 
rules.  We  do  not  believe  that  this  simplicity  can  be  maintained.  In  the  future,  product 
competitiveness  will  be  determined,  in  part,  by  the  ability  of  the  circuit  designer  to 
design  systems  which  simultaneously  extract  the  maximum  performance  from  critical 
devices  while  avoiding  the  edge  of  complex-shaped  low-reliability  regimes.  This  will 
be  possible  only  with  the  use  of  high-quality  reliability  models  and  computer-aided 
design  tools.  RELIC  is  a  demonstration  of  the  sort  of  low-level  design  tools  which  will 
be  necessary.  (It  is  also  worthwhile  to  ask  about  higher-level  tools  which  aid  reliability 
design.) 
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One  of  the  limitations  of  RELIC  is  that  reliability  models  are  not  sufficiently  devel¬ 
oped  to  predict  the  failure  rate  of  a  chip,  even  given  exact  process  and  circuit  models. 
Nevertheless,  it  should  be  possible  to  compare  two  circuits  for  relative  reliability  and 
thereby  guide  the  design  of  high-reliability  parts.  There  is  a  second,  less  obvious, 
application.  One  commonly  used  technique  for  improving  the  reliability  of  a  part  is 
to  do  an  accelerated  burn-in  to  remove  the  weak  devices,  those  which  contribute  to 
infant  mortality.  It  is  not  always  clear,  however,  how  to  accelerate  the  stress  on  a  part 
because,  though  one  may  raise  the  power-supply  voltage  from  5  to  7  V,  the  voltages 
internal  to  the  chip  need  not  follow.  (Consider  voltages  controlled  by  current  mirrors 
or  charge  pumps.)  One  therefore  needs  to  design  for  stressability.  For  high  reliability 
applications  one  must  be  able  to  design  a  circuit  so  that  accelerated  bum-in  can  be 
accomplished.  RELIC  is  suited  to  this  task. 

RELIC  is  a  first-generation  program.  It  is  an  experimental  program  written  by 
Miss  Terry  Hohol  on  an  experimental  program  (RELAX  2.2).  It  is  therefore  not 
surprising  that  while  the  program  is  operational,  it  is  neither  robust  nor  user  friendly. 
But  we  have  learned  many  things  from  RELIC.  A  second-generation  program,  now 
under  development,  will  improve  upon  RELIC  in  five  ways:  (l)  it  will  be  based  on  a 
more  solid  circuit  simulation  program;  (2)  the  models  will  be  improved  based  on  our 
better  understanding  of  the  literature;  (3)  the  control  structure  will  be  modified  so 
that  it  is  easy  to  see  the  long-term  effects  of  transconductance,  threshold,  and  leakage 
degradation  on  circuit  performance;  (4)  a  post-processor  will  be  added  which  predicts 
failure  rates  and  cumulative  percent  fails  in  terms  of  the  “wear”  simulation  variables. 
This  means  that  one  must  model  the  MTTF  and  a  of  as  many  as  three  statistical 
populations  (main,  freak,  infant);  and  (5)  we  intend  to  make  the  second-generation 
program  more  robust  and  hence  usable  by  the  community. 

Assuming  that  we  can  accomplish  these  tasks,  a  simulation  program  is  still  only 
as  good  as  the  models.  Improved  models  are  desperately  needed.  Even  after  all 
these  years  of  research,  metal  migration  is  still  not  well  understood.  For  instance, 
one  can  find  in  the  literature  papers  that  predict  that  pulsed  operation  is  better,  and 
worse,  than  steady-state  operation.  Hot  carrier  models  are  in  reasonable  shape  for 
dc  excitation  but,  again,  the  effects  of  trapping  and  de-trapping  time  constants  on 
dynamic  stress  is  unclear.  Time-dependent  dielectric  breakdown  is  not  well  modeled 
even  under  dc  excitation — from  electric  field  data  there  appear  to  be  at  least  two  com¬ 
peting  mechanisms — and  pulsed  dynamics  and  interactions  with  hot-carrier  stressing 
are  generally  mysterious.  Quality  programs  to  do  dynamic  reliability  simulation  will 
soon  exist.  It  is  hoped  that  the  reliability  physics  community  will  be  able  to  meet  the 
enormous  challenge  of  quantifying  the  dynamics  of  device  failure  under  stress  so  that 
these  programs  can  accurately  predict  system  reliability. 
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Abstract 


We  describe  an  efficient  technique  for  breaking  symmetry  in  parallel.  The 
technique  works  especially  well  on  rooted  trees  and  on  graphs  with  a  small 
maximum  degree.  In  particular,  we  can  find  a  maximal  independent  set  on 
a  constant-degree  graph  in  O(lg’n)  time  on  an  EREW  PRAM  using  a  linear 
number  of  processors.  We  show  how  to  apply  this  technique  to  construct  more 
efficient  parallel  algorithms  for  several  problems,  including  coloring  of  planar 
graphs  and  (A  +  l)-coloring  of  constant-degree  graphs.  We  also  prove  lower 
bounds  for  two  related  problems. 


1  Introduction 

Some  problems  for  which  trivial  sequential  algorithms  exist  appear  to  be  much 
harder  to  solve  in  a  parallel  framework.  Therefore,  new  methods  are  needed  for 
design  of  efficient  parallel  algorithms.  A  known  example  of  a  problem  with  a  trivial 
sequential  algorithm  which  is  hard  to  solve  in  parallel,  is  the  problem  of  finding  a 
maximal  independent  set  in  a  graph  [18].  This  problem  was  shown  to  be  in  the  class 
NC  by  Karp  and  Wigderson  [12].  A  simple  randomized  algorithm  for  the  problem 
is  due  to  Luby  [15].  Recently  M.  Goldberg  and  Spencer  [9]  gave  a  deterministic 
algorithm  for  the  problem  that  runs  in  polylogarithmic  time  using  a  linear  number 
of  processors. 

The  study  of  the  maximal  independent  set  problem  shows  the  importance  of 
techniques  for  breaking  symmetry  in  parallel.  The  symmetry-breaking  comes  up 
in  many  other  parallel  algorithms  as  well.  In  many  cases,  however,  it  is  enough 
to  be  able  to  break  symmetry  in  special  kinds  of  graphs.  The  performance  of  the 
resulting  algorithm  improves  if  we  can  solve  the  special  case  of  symmetry-breaking 
more  efficiently. 

In  this  paper  we  present  a  technique  for  breaking  symmetry.  In  particular,  we 
give  an  0(lg*n)  time  algorithm  to  3-color  a  rooted  tree.  This  techniques  can  be 
viewed  as  a  generalization  of  the  deterministic  coin-flipping  technique  of  Cole  and 
Vishkin  [5].  To  show  the  usefulness  of  our  technique,  we  present  the  following 
algorithms.  All  of  the  presented  algorithms  use  a  linear  number  of  processors. 
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•  For  graphs  whose  maximum  degree  is  a  constant  A,  we  give  an  0( A  Ig  A  lg‘n) 
algorithm  for  (A  + l)-coloring  and  for  finding  a  maximal  independent  set  on 
an  EREW  PRAM. 

•  We  give  an  algorithm  to  7-color  a  planar  graph.  This  algorithm,  and  the 
maximal  independent  set  (for  planar  graphs)  algorithm  based  on  it,  run  in 
0(lg  n  lg*n)  time  on  a  CRCW  PRAM  and  in  0(lg2n)  time  on  an  EREW 

PRAM.  We  also  give  an  0(lg3nlg"n)  CRCW'  algorithm  to  5-color  a  planar 
graph. 

•  We  give  a  O(lgnlg'n)  algorithm  for  finding  a  maximal  matching  in  a  planar 
graph  on  a  CRCW  PRAM. 

•  For  general  graphs  we  give  an  0(A2lgn)  algorithm  for  (A-fl)-coloring  and 
for  finding  a  maximal  independent  set  on  EREW  PRAM. 

The  above  stated  results  improve  the  running  time  and  processor  bounds  for  the 
respective  problems.  The  fastest  previously  known  algorithm  for  (A-fl)-coloring 
[15],  in  the  case  of  constant-degree  graphs,  runs  in  O(lgn)  time,  and  the  determin¬ 
istic  version  of  this  algorithm  requires  n3  processors.  The  5-coloring  algorithm  for 
planar  graphs,  due  to  Boyar  and  Karloff,  [4]  runs  in  0(lg3  n)  time,  and  the  deter¬ 
ministic  version  of  this  algorithm  requires  n3  processors.  The  0(lg3  n)  running  time 
of  the  maximal  matching  algorithm  due  to  Israeli  and  Shiloach  [10]  can  be  reduced 
to  0(lg2  n)  in  the  restricted  case  of  planar  graphs,  but  our  algorithm  is  faster. 

Although  in  this  paper  we  have  limited  ourselves  to  the  application  of  our  tech¬ 
niques  for  the  design  of  of  parallel  algorithms  for  the  PRAM  model  of  computation, 
the  same  techniques  can  be  applied  in  a  distributed  model  of  computation  [1,7]. 
Moreover,  the  0(lg*n)  lower  bound,  given  by  Linial  [14]  for  the  maximal  indepen¬ 
dent  set  problem  on  a  chain  in  the  distributed  model,  shows  that  our  symmetry¬ 
breaking  technique  is  optimal  in  this  model. 

The  fact  that  a  rooted  tree  can  be  3-colored  in  O(lg'n)  time  raises  the  question 
whether  a  rooted  tree  can  be  2-colored  within  the  same  time  complexity.  We  answer 
this  question  by  giving  an  Q(lg  n/  lglg  n)  lower  bound  for  2-coloring  of  a  rooted  tree. 
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We  also  present  an  fl(lg  n/ lglg  n)  lower  bound  for  finding  a  maximal  independent 
set  in  a  general  graph,  thus  answering  the  question  posed  by  Luby  [15]. 

Some  of  the  results  presented  here  were  obtained  independently  by  Shannon 

[16]. 

2  Definitions  and  Notation 

This  sections  describes  the  assumptions  about  the  computational  model,  and  intro¬ 
duces  the  notation  used  throughout  the  paper.  In  this  paper  we  use  n  to  denote 
the  number  of  vertices  and  m  to  denote  the  number  of  edges  in  a  graph.  We  use  A 
to  denote  the  maximum  degree  of  the  graph. 

Given  a  graph  G  =  (V,  E ),  we  say  that  a  subset  of  nodes  /  €  V  is  independent  if 
no  two  nodes  in  I  are  adjacent.  A  coloring  of  a  graph  G  is  an  assignment  C  :  V  —*  N 
of  positive  integers  (colors)  to  nodes  of  the  graph.  A  coloring  is  valid  if  no  two 
adjacent  nodes  have  the  same  color.  The  ith  bit  in  the  color  of  a  node  v  is  denoted 
bv  Cv(i).  A  subset  of  edges  M  £  E  is  a  matching  if  any  two  distinct  edges  in  M 
have  no  nodes  in  common. 

The  following  problems  axe  discussed  in  the  paper: 

•  The  vertex-coloring  (VC)  problem:  find  a  valid  coloring  of  a  given  graph  that 
uses  at  most  A-t-1  colors. 

•  The  maximal  independent  set  (MIS)  problem:  find  a  maximal  independent 
set  of  vertices  in  a  given  graph. 

•  The  maximal  matching  (MM)  problem:  find  a  maximal  matching  in  a  given 
graph. 

We  make  a  distinction  between  unrooted  and  rooted  trees.  In  a  rooted  tree,  each 
nonroot  node  knows  which  of  its  neighbors  is  its  parent. 


The  following  notation  is  used: 


lgz 

=  log2  X 

lg(1)  X 

=  Igx 

lg(,)x 

=  lglg(i-1)  X 

Ig’x 

=  min{i|lg( 

We  assume  a  PRAM  model  of  computation  %vhere  each  processor  is  capable 
of  executing  simple  word  and  bit  operations.  The  word  width  is  assumed  to  be 
0(lgrc)-  The  word  operations  we  use  include  bit-wise  boolean  operations,  integer 
comparisons,  and  unary- to- binary  conversion.  In  addition,  we  assume  that  each 
processor  has  a  unique  identification  number  0(lg  n)  bits  wide,  which  we  denote  by 
PE-ID.  We  use  exclusive-read,  exclusive-write  (EREW)  PRAM,  concurrent-read, 
exclusive-write  (CREW)  PRAM,  and  concurrent-read,  concurrent-write  (CRCW) 
PRAM  as  appropriate.  All  lower  bounds  are  proven  for  a  CRCW  PRAM  with  a 
polynomial  number  of  processors. 


3  Coloring  Rooted  Trees 


This  section  describes  an  0(lg*n)  time  algorithm  for  3-coloring  rooted  trees.  First 
we  describe  an  0(lg*n)  time  algorithm  for  6-coloring  rooted  trees.  Then  we  show 
how  to  transform  a  6-coloring  of  a  rooted  tree  into  3-coloring  in  constant  time. 

The  procedure  6- Color- Rooted- Tree  is  shown  in  Figure  1.  This  procedure  accepts 
a  rooted  tree  T  =  (V,E)  and  6-colors  it  in  time  O(lg'n).  Starting  from  the  valid 
coloring  given  by  the  processor  ID’s,  the  procedure  iteratively  reduces  the  number 
of  bits  in  the  color  descriptions  by  recoloring  each  non-root  node  v  with  the  color 
obtained  by  concatenating  the  index  of  a  bit  in  which  Cv  differs  from  C/ot/,er(v)  and 
the  value  of  this  bit.  The  root  r  appends  Cr[0]  to  0. 

Theorem  1  The  algorithm  6-Color-Rooted-Tree  produces  a  valid  6-coloring  of  a 
tree  in  0(lg*n)  time  on  a  CREW  PRAM  using  0(n)  processors. 
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PROCEDURE  6-Color- Rooted-Tree 
L  -  fig  ril 

for  all  v  6  V  in  parallel  do  Cv  *—  PE-lD(u)  ;;;  initial  coloring 
while  L  >  fig  L  +  1]  for  all  v  6  V  in  parallel  do 
if  v  is  the  root 
then  do 

iv  —  0 
bv  -  Cv( 0) 

end 

else  do 

*'v  *—  min{i  |  Cv(i)  7^  C/atAer(v)(0} 
bv  *  Cv(iv ) 

end 

Cy  '  byly 

end 


Figure  1:  The  Coloring  Algorithm  for  Rooted  Trees 


Proof :  First  we  prove  by  induction  that  the  coloring  computed  by  the  algorithm  is 
valid,  and  then  we  prove  the  upper  bound  on  the  execution  time. 

Assume  that  the  coloring  C  is  valid  at  the  beginning  of  an  iteration,  and  show 
that  the  coloring  at  the  end  of  the  iteration  is  also  valid.  Let  v  and  w  be  two 
adjacent  nodes;  without  loss  of  generality  assume  that  v  is  the  father  of  w.  By  the 
algorithm,  w  chooses  some  index  i  such  that  Cv(i)  ^  Cw(i)  and  v  chooses  some 
index  j  such  that  Cv(j)  C father(v)U)-  new  color  of  w  is  ( i,Cw(i ))  and  the 

new  color  of  v  is  (j,Cu(j)).  If  i  ^  j ,  the  new  colors  are  different  and  we  are  done. 
On  the  other  hand,  if  i  =  j,  then  Cv(i)  ^  Cw(<.  1  and  again  the  colors  are  different. 
Hence,  the  validity  of  the  coloring  is  preserved. 

Now  we  show  that  the  algorithm  terminates  after  0(lg*n)  iterations.  Let  Lk 
denote  the  number  of  bits  in  the  representation  of  colors  after  k  iterations.  For 
k  =  1  we  have 


Lx  =  fig  £1  +  1 
<  2  fig  I] 
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if  fig  II  >  1. 

Assume  for  some  k  we  have  Lk-\  <  2flg^-1'  I]  and  flg^*  I]  >  2.  Then 

Ifc  =  fig  Ifc-il  +  1 

<  rig(2ig(i-i)i)i +  i 

<  2[lg(fc)  I] 

Therefore,  as  long  as  [lg*^  I]  >  2, 

£*<2flg<->Il. 

Hence,  the  number  of  bits  in  the  representation  of  colors  Lk  decreases  until,  after 
O(lg'n)  iterations,  [lg^  I]  becomes  1  and  Lk  reaches  the  value  of  3  (the  solution  of 
I  =  fig  I]  +1).  Another  iteration  of  the  algorithm  produces  a  6-coloring:  3  possible 
values  of  the  index  iv  and  2  possible  values  of  the  bit  bv.  The  algorithm  terminates 
at  this  point. 

We  uce  concurrent-read  capability  to  broadcast  the  newly  computed  color  Cv  to 
all  the  sons  of  v;  no  concurrent-write  capabilities  are  required.  For  constant-degree 
trees  the  concurrent-read  capability  is  not  needed  either.  | 

As  we  have  shown,  a  rooted  tree  can  be  6-colored  quickly.  A  natural  question 
to  ask  at  this  point  is  whether  one  can  use  less  colors  and  still  stay  within  the  same 
complexity  bounds.  The  following  theorem  answers  this  question. 

Theorem  2  A  rooted,  tree  can  be  3-colored  in  0(lg*n)  CREW  PRAM  time  using 
0(n )  processors. 

Proof:  The  algorithm  3 -Color- Rooted-  Tree  presented  in  Figure  2  starts  by  using  the 
previously  described  algorithm  to  6-color  the  tree  and  then  recolors  it  in  3  colors  in 
constant  time. 

The  algorithm  recolors  the  nodes  colored  with  bad  colors  3,  4,  and  5,  into  good 
colors  0,  1,  2  as  follows.  First,  each  node  is  recolo-ed  in  the  color  of  its  father,  so 
that  any  two  nodes  with  the  same  father  have  the  same  color.  The  root,  which 
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has  no  father,  recolors  itself  with  a  color  different  from  its  current  color.  Next, 
the  algorithm  removes  the  color  from  every  node  that  has  a  bad  color  and  has  a 
neighbor  with  a  good  color.  These  nodes  become  uncolored.  Every  node  v  that  still 
has  a  color  Cv  is  recolored  in  the  color  Cv  mod  3;  this  gets  rid  of  the  remaining  bad 
colors.  Note  that  this  coloring  has  the  nice  property  that  for  any  node  v,  all  of  the 
sons  of  v  that  are  colored,  must  be  colored  identically. 

The  resulting  coloring  is  valid,  but  not  all  nodes  are  colored.  By  the  construction, 
every  uncolored  node  has  at  least  one  colored  neighbor.  Therefore,  if  there  are 
two  nodes  v  and  w,  such  that  v  =  father(w)  and  both  nodes  are  uncolored,  then 
father{v)  is  colored  and  sons(w)  are  colored  too.  The  algorithm  colors  v  with  a 
color  different  from  CJonJ(„)  and  from  C  jather(v)-  Such  a  color  always  exists  because 

there  are  3  different  colors  to  choose  from  and  all  the  colored  sons  of  u  are  colored 
with  the  same  color.  Finally,  the  algorithm  colors  w  with  a  color  different  from  both 
C„  and  Csona{w).  Every  step  of  the  3-  Color- Rooted-  Tree  algorithm  can  be  executed 
in  constant  time  except  for  the  first  one,  in  which  we  color  the  tree  with  6  colors. 
Hence,  the  total  running  time  of  the  algorithm  is  0(lgmn).  | 

Any  tree  can  be  2-colored.  In  fact,  it  is  easy  to  2-color  a  tree  in  polylogarithmic 
time.  For  example,  one  can  use  treefix  operations  [13]  to  compute  the  distance  from 
each  node  to  the  root,  and  color  even  level  nodes  with  one  color  and  odd  level  nodes 
with  the  other  color.  It  is  harder  to  find  a  2-coloring  of  a  rooted  tree  in  parallel, 
however,  than  it  is  to  find  a  3-coloring  of  a  rooted  tree.  In  section  7  we  show  a  lower 
bound  of  fl(lgn/lglgn)  on  2-coloring  of  a  directed  list  by  a  CRCW  PRAM  with  a 
polynomial  number  of  processors,  which  implies  the  same  lower  bound  for  rooted 
trees. 


4  Coloring  Constant-Degree  Graphs 


The  method  for  coloring  rooted  trees  described  in  the  previous  section  is  a  gener¬ 
alization  of  the  deterministic  coin-flipping  technique  described  in  [5].  The  method 
can  be  generalized  even  further  [8]  to  color  constant-degree  graphs  in  a  constant 
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PROCEDURE  3-Color- Rooted-Tree 
C  *—  6- Color- Rooted-Tree  (V,  E) 
for  all  v  £  V,  v  ^  root  in  parallel  do 

Cu  *  C father(v) 

end 

Croot  *—  min{  {0,  It  2}  —  {C'J0ni(r0O()}  } 

Vi  -  {v  |  <  2} 

V2  —  V  -  Vi 

V'  {v  \  v  £  V2  and  3(v,w)  £  E,w  £  Vj}  ;;;  fcaJ-coIored  nodes  with  </ooc/-colored  neighbors 
for  all  v  £  V  —  V'  in  parallel  do 
Cv  «—  Cv  mod  3 

end 

for  all  v  £  V'  in  parallel  do 
Cv  *—  uncolored 

end 

for  all  v  €  V'  in  parallel  do 
if  father(v)  £  V' 
then  do 

C„  —  min{  {0, 1,2}  -  {CJon,(v)}  -  {Cjather(v)}  } 

V'  i-V'  -v 

end 

end 

for  all  v  £  V'  in  parallel  do 

Cv  *  min{  {0,1,2}  —  {C30n3(t,)}  —  {Cjather( u)}  } 

end 


Figure  2:  The  3-coloring  Algorithm  for  Rooted  Trees 
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number  of  colors.  In  the  generalized  algorithm,  a  current  color  of  a  node  is  replaced 
by  a  new  color  obtained  by  looking  at  each  neighbor,  appending  the  index  of  a  bit 
in  which  the  current  color  of  the  node  is  different  from  the  neighbor’s  color  to  the 
value  of  the  bit  in  the  node  color,  and  concatenating  the  resulting  strings.  This 
algorithm  runs  in  0(lg*n)  time,  but  the  number  of  colors,  although  constant,  is 
exponential  in  the  degree  of  the  graph. 

In  this  section  we  show  how  to  use  the  procedure  3- Color- Rooted-Tree  described 
in  the  previous  section  to  color  a  constant-degree  graph  with  (A  +  l)  colors,  where 
A  is  the  maximum  degree  of  the  graph. 

First,  we  describe  how  to  find  in  constant  time  a  forest  in  a  given  graph  such  that 
each  node  with  nonzero  degree  in  the  graph  has  nonzero  degree  in  the  forest.  The 
removal  of  the  edges  of  the  forest  decreases  the  maximum  degree  of  the  remaining 
graph  (unless  the  maximum  degree  of  the  graph  is  zero).  We  shall  use  this  property 
later  use  to  decompose  the  edges  into  A  sets,  each  set  inducing  a  forest  on  the  nodes 
of  the  graph.  The  procedure  Find-Forest  (see  Figure  3)  constructs  such  a  forest. 

The  procedure  has  two  steps.  In  the  first  step  each  node  compares  the  ID’s  of 
its  neighbors  with  its  own  ID.  A  node  that  does  not  have  the  maximum  processor 
ID  among  its  neighbors  chooses  an  edge  that  connects  it  to  the  neighbor  with  the 
largest  processor  ID.  The  graph  induced  by  the  chosen  edges  is  a  forest  (the  graph 
has  no  cycles)  and  the  nodes  with  the  highest  processor  IDs  among  their  neighbors 
-  local  maximums  -  are  roots  of  the  forest.  In  the  second  step  each  root  with  no 
sons  chooses  an  edge  that  connects  it  to  one  of  its  neighbors.  The  roots  are  local 
maximums  and  axe  therefore  independent.  Hence,  no  new  cycles  axe  introduced  into 
the  graph  induced  by  the  chosen  edges. 

The  algorithm  Color- Constant- Degree- Graph  that  colors  constant-degree  graph 
with  (A  +  l)  colors  is  presented  in  Figure  4.  The  algorithm  consists  of  two  phases. 
In  the  first  phase  we  iteratively  call  the  Find- Forest  procedure,  each  time  removing 
the  edges  of  the  constructed  forest.  This  phase  continues  until  no  edges  remain,  At 
which  point  we  color  all  the  nodes  with  one  color. 

In  the  second  phase  we  iteratively  return  the  edges  of  the  forests  into  the  graph, 
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PROCEDURE  Find- Forest(V, E) 

-  0 

R  «-  0 

for  all  tifV  in  parallel  do  ;;;  construct  the  forest  -  the  first  step 
if  PE-ID(t;)  is  not  a  local  maximum 
then  do 

ev  <—  (v,  w)  s.t.  (v,w)  G  E  and  PE-ID(tn)  =  max{PE-ID(u)|(u,  u)  G  E} 
£'*-£'  U 

end 

else  do 

R  R  U® 

end 

end 

for  all  v  G  R  in  parallel  do  ;;;  get  rid  of  zero-depth  trees  -  the  second  step 
if  ^(u,u;)  G  E'  and  3 (v,w')  G  E 
then  do 

E' «-  E‘  U  (v,  w') 

end 

end 

return  (E1)  ;;;  the  edges  of  the  forest 


Figure  3:  The  Spanning  Forest  Algorithm 
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P RO C ED  U RE  Color- Constant- Degree- G raph 
E'  -  E 
i  -  0 

while  E'  0  do  ;;;  the  first  phase 
Ei  <—  Find- Forest^,  E') 

Ei 

i  *—  i  +  1 

end 

for  all  v  £  V  in  parallel  do  ;;;  initial  coloring 

C{v)+-  1 

end 

for  i  <—  i  —  1  to  0  do  ;;;  the  second  phase 
C'  <—  3- Color- Rooted- Tree  (V,  Ex) 

E'  -E'  +  Ei 
for  k  *-  1  to  3  do 

for  j  <—  1  to  A  + 1  do 
V'  «-  V 

for  all  v  €  V'  in  parallel  do 
if  C(t>)  =  j  and  C'{v )  =  k 

then  do 

C(v)  *-  max{{l,2,...A  +  l}  -  { C(w )  |  ( v,w )  €  £'}} 
V'  <-  V'  -  v 

end 

end 

end 

end 

end 


Figure  4:  The  Recoloring  Algorithm  for  Constant  Degree  Graphs 
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each  time  recoloring  the  nodes  to  maintain  a  consistent  coloring.  At  the  beginning 
of  each  iteration  of  this  phase,  the  edges  of  the  current  forest  ( E ')  are  added,  making 
the  existing  (A  +  l)-coloring  inconsistent.  This  forest  is  colored  with  3  colors  using 
the  3- Color- Rooted- Tree  procedure.  Now,  each  node  has  two  colors  -  one  from  the 
coloring  at  the  previous  iteration  and  one  from  the  coloring  of  the  forest.  The 
pairs  of  colors  form  a  valid  3(A  +  l)-coloring  of  the  graph.  The  iteration  finishes  by 
enumerating  the  color  classes,  recoloring  each  node  of  the  current  color  with  a  color 
from  {0, ...,  A}  that  is  different  from  the  colors  of  its  neighbors  (note  that  we  can 
recolor  all  the  nodes  of  the  same  color  in  parallel  because  they  are  independent). 


Theorem  3  The  algorithm  Color-Constant-Degree-Graph  runs  in  0(Alg  A(A  + 
lg'n))  time  and  colors  the  graph  with  (A  +  l)  colors. 


Proof :  At  each  iteration  all  edges  of  the  spanning  forest  are  removed.  From  the 
above  discussion  it  follows  that  each  node  that  still  has  neighbors  in  the  beginning 
of  an  iteration,  has  at  least  one  edge  removed  during  that  iteration,  and  therefore 
its  degree  decreases.  Hence,  the  first  phase  of  the  algorithm  terminates  in  at  most 
A  iterations. 

The  second  phase  terminates  in  at  most  A  iterations  as  well.  Each  iteration 
consists  of  two  stages.  First,  the  current  forest  is  colored  using  procedure  3- Color- 
Rooted- Tree,  which  takes,  by  theorem  2,  0(lgAlg*n)  time  on  an  EREW  PRAM 
(the  IgA  factor  appears  because  we  do  not  use  the  concurrent-read  capability). 
Now  we  iterate  over  all  the  colors.  Since  in  this  section  we  assume  that  A  is  a 
constant,  each  iteration  can  be  done  in  0( lg  A)  time  using  word  operations.  Hence, 
one  iteration  of  the  second  phase  takes  0(lg  Alg'n  +  A  IgA)  time,  leading  to  an 
overall  0(Alg  A(A  +  lg*n))  running  time  on  an  EREW  PRAM.  I 

Having  a  ( A+l)-coloring  of  a  graph  enables  us  to  find  an  MIS  in  this  graph.  The 
.  following  theorem  states  this  fact  formally.  (We  refer  to  the  algorithm  described  in 
the  proof  as  Constant-Degree-MIS  in  the  subsequent  sections.) 
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Theorem  4  An  MIS  in  constant- degree  graphs  can  be  found  in  0(lg*n)  time  on  an 
EREW  PRAM  using  0(n )  processors. 

Proof:  After  coloring  the  graph  in  a  constant  number  of  colors  using  the  procedure 
Color- Constant- Degree- Graph,  one  can  find  an  MIS  by  iterating  over  the  colors, 
taking  all  the  remaining  nodes  of  the  current  color,  adding  them  to  the  independent 
set,  and  removing  them  and  all  their  neighbors  from  the  graph.  By  theorem  3.  the 
coloring  of  a  constant-degree  graph  takes  0(lg*n)  time  on  an  EREW  PRAM.  The 
selection  of  all  nodes  with  a  specific  color  and  the  removal  of  all  neighbors  of  the 
selected  nodes  takes  constant  time.  | 

The  proofs  of  theorems  3  and  4  also  imply  that  the  algorithms  Color-  Constant- 
Degree- Graph  and  Constant- Degree- MIS  have  a  polylogarithmic  running  times  for 
graphs  with  polylogarithmic  maximum  degrees.  However,  in  this  case  the  assump¬ 
tion  that  the  word  size  is  greater  then  A  is  unreasonable,  so  the  running  time  of 
the  algorithms  becomes  0(A(A2  +  lg  Alg’n)).  In  section  6  we  present  an  algorithm 
with  better  performance  for  A  =  w(lgn). 

The  above  algorithms  can  be  implemented  in  the  distributed  model  of  com¬ 
putation  [1,7],  where  processors  have  fixed  connections  determined  by  the  input 
graph.  The  algorithms  in  the  distributed  model  achieve  the  same  0(lg*n)  bound 
as  in  the  EREW  PRAM  model.  Linial  has  recently  shown  [14]  that  fl(lg’n)  time 
is  required  in  the  distributed  model  to  find  a  maximal  independent  set  on  a  chain. 
Our  algorithms  are  therefore  optimal  (to  within  a  constant  factor)  in  the  distributed 
model. 


5  Algorithms  for  Planar  Graphs 


Any  planar  graph  can  be  4-colored.  However,  linear  time  sequential  algorithms  are 
known  only  for  5-coloring  planar  graphs.  In  this  section  we  describe  a  simple  and 
efficient  parallel  algorithm  that  7-colors  a  planar  graph,  and  show  how  to  construct 
a  more  complicated  parallel  algorithm  to  5-color  a  planar  graph. 
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PROCEDURE  7-Color-Planar-Graph 
V'  «-  V 

Vi,V2,...Vi,n  «-0 
i  —  0 

while  V'  7 £  0  for  all  u  G  V'  do  in  parallel  ;;;  first  stage 
if  Degree(u)  <  6 
then  do 

Vi  -  Vi  +  v 

V'i-V'-v 

end 

i  <—  i  +  1 

end 

for  i  <—  i  —  1  to  0  do  ;;;  second  stage 
while  Vi  ^  0  do 

Ei  <-  {(u,  u>)  |  v,  tr  €  VJ  ;  (v,  w)  6  £} 

/  «—  Constant-Degree-MIS(VJ,  £,) 
for  all  v  6  I  do  in  parallel 

Cu  <-  max{{l . .  .7}  -  {Cw  |  w  £  V';(tr,  w)  £  E)  } 

end 

V'  *-  V1  +  I 
Vi  -  Vi  -  I 

end 

end 


Figure  5:  The  7-Coloring  Algorithm  For  Planar  Graphs 
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First  we  describe  an  algorithm  for  7-coloring  of  planar  graphs.  The  algorithm, 
called  7-  Color- Planar-  Graph,  is  shown  in  Figure  5.  The  algorithm  consists  of  two 
stages.  In  the  first  stage,  we  iteratively  partition  the  vertices  of  the  graph  into 
layers.  At  each  iteration  we  create  a  new  layer  consisting  of  all  nodes  of  the  graph 
with  degree  6  or  less  and  delete  these  nodes  from  the  graph. 

The  second  stage  returns  the  layers  to  the  graph  in  the  order  opposite  to  the 
order  in  which  the  layers  are  removed.  After  a  layer  is  returned,  it  is  7-colored  in 
the  way  consistent  with  the  coloring  of  the  layers  which  have  been  returned  and 
colored  in  the  previous  iterations.  Note  that  all  the  nodes  of  the  returned  layer  have 
a  degree  of  at  most  6  in  the  current  graph. 

The  layer  is  colored  by  iteratively  applying  the  Constant-Degree-MIS  procedure 
to  find  an  MIS  in  the  subgraph  induced  by  the  uncolored  nodes  of  the  layer,  and 
coloring  each  of  the  selected  nodes  in  a  color  different  from  its  colored  neighbors. 
Since  the  uncolored  nodes  have  a  degree  of  at  most  6  in  the  current  graph,  we  never 
need  more  than  7  colors. 

Theorem  5  The  algorithm  7-Color-Planar-Graph  runs  in  O(lgnlg'n)  time  on  a 
CRCW  PRAM  and  in  0(lg2n)  time  on  an  EREW  PRAM. 

Proof :  In  a  planar  graph,  at  least  a  constant  fraction  (l/7th)  of  nodes  have  a  de¬ 
gree  less  or  equal  to  6,  and  therefore  the  first  stage  of  the  7- Color- Planar- Graph 
algorithm  terminates  in  at  most  O(lgn)  steps.  At  each  step  we  have  to  identify 
the  nodes  that  have  degree  less  than  7  in  the  remaining  graph.  This  takes  constant 
time  on  a  CRCW  PRAM  (assuming  that  if  two  or  more  processors  simultaneously 
write  into  some  location,  one  of  them  will  succeed)  and  0(lg  n)  time  on  an  EREW 
PRAM. 

In  the  second  stage  all  the  uncolored  nodes  are  of  degree  k  ::  or  equal  to  6 
and  therefore,  by  theorem  4,  the  procedure  Constant-Degree-MIS  finds,  in  0(lg*n) 
time,  an  MIS  in  the  graph  induced  by  these  nodes.  By  the  definition  of  the  maximal 
independent  set,  when  the  algorithm  colors  the  MIS,  at  least  one  uncolored  neighbor 
of  each  uncolored  node  becomes  colored.  Therefore  the  second  part  of  the  second 
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stage  terminates  in  at  most  7  iterations. 

Since  the  first  stage  takes  0(lg  n)  time  on  a  CRCW  PRAM  and  0(lg2  n)  time  on 
an  EREW  PRAM,  and  since  each  one  of  the  0(lg  n)  iterations  of  the  second  stage  is 
dominated  by  a  call  to  Constant- Degree- MIS,  the  total  running  time  is  O(lgnlg'n) 
on  a  CRCW  PRAM  and  0(lg2  n)  on  an  EREW  PRAM.  | 

Remark:  If,  at  each  stage,  instead  of  removing  from  the  graph  all  the  nodes  with 
degree  less  than  6,  we  remove  all  the  nodes  with  degree  less  or  equal  to  the  average 
degree,  the  algorithm  described  above  produces  a  correct  result  in  polylogarithmic 
time  for  any  graph  G  such  that  the  average  degree  of  any  node-induced  subgraph 
G'  of  G  is  polylogarithmic  in  the  size  of  G' .  This  class  contains  many  important 
subclasses  including  graphs  that  are  unions  of  a  polylogarithmic  number  of  planar 
graphs  (i.e.  graphs  with  polylogarithmic  thickness). 

Our  techniques  together  with  the  ideas  presented  in  [4]  can  be  used  to  construct 
a  deterministic  O(log3  nlg'n)  time  algorithm  for  5-coloring  a  planar  graph. 

The  5-coloring  algorithm  has  two  stages.  The  first  stage  of  the  algorithm  par¬ 
titions  the  graph  into  layers  such  that  vertices  in  any  layer  are  independent  and 
have  degree  of  at  most  6  in  the  graph  induced  by  the  vertices  in  its  layer  and  the 
higher  numbered  layers.  The  second  stage  of  the  algorithm  adds  layers  one  by  one, 
starting  from  the  layer  with  the  highest  number,  each  time  recoloring  the  graph 
with  5  colors. 

Before  describing  the  second  stage,  we  need  the  following  definitions.  Let  G  be  a 
partially  colored  graph  and  let  C\  and  C2  be  two  distinct  colors.  A  color  component 
is  a  connected  component  of  a  subgraph  of  G  induced  by  all  vertices  of  color  ci  and 
C2-  A  color  component  flip  is  a  recoloring  of  the  color  component  that  exchanges 
colors  Ci  and  c2.  A  color  component  flip  does  not  affect  the  validity  of  coloring. 

We  can  proceed  with  the  description  of  the  second  stage  of  the  algorithm.  After  a 
layer  is  added  to  already  colored  graph,  we  first  color  all  vertices  that  can  be  colored 
without  changing  the  existing  coloring.  This  can  be  done  in  the  same  way  as  in 
the  7-coloring  algoritnm.  Now  all  5  colors  are  represented  among  neighbors  of  each 
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uncolored  vertex.  Since  the  uncolored  vertices  have  degree  of  at  most  6,  the  results 
of  [4]  imply  that  for  every  uncolored  vertex  v  there  are  two  colors  Ci  and  c2  such  that 
v  has  exactly  one  neighbor  uq  of  color  cx  and  exactly  one  neighbor  t/;2  of  color  c2. 
Furthermore,  the  vertices  W\  and  u>2  belong  to  different  color  components  induced 
by  colors  C\  and  c2.  Flipping  each  one  of  these  color  component  allows  us  to  color  v. 
The  problem  is,  however,  that  flipping  both  color  components  simultaneously  does 
not  allow  us  to  color  v.  We  call  such  color  components  dependent. 

Where  as  Boyar  and  Karloff  use  randomness  to  deal  with  this  problem,  we  use 
our  symmetry-breaking  techniques  as  follows.  For  each  pear  of  distinct  colors  Ci 
and  c2,  we  construct  color  components  induced  by  these  colors.  Then  we  construct 
a  dependency  graph  with  vertices  corresponding  to  the  color  components  and  edges 
corresponding  to  the  dependencies  between  the  color  components.  Flipping  a  set  of 
color  components  that  corresponds  to  an  independent  set  in  the  dependency  graph 
does  not  cause  conflicts.  Suppose  we  can  find  an  independent  set  in  the  dependency 
graph  such  that  flipping  the  corresponding  set  of  color  components  allows  us  to  color 
a  constant  set  of  uncolored  vertices.  Then  in  0(log  n)  iterations  will  be  able  to  color 
all  uncolored  vertices. 

We  find  such  an  independent  set  in  the  dependency  graph  as  follows.  Observe 
that  the  dependency  graph  is  planar,  so  we  cam  7-color  this  graph  using  the  7- 
Color- Planar- Graph  algorithm.  Then,  for  each  pair  of  distinct  colors  and  for  each 
color  class  of  the  corresponding  dependency  graph,  we  compute  the  number  of 
uncolored  vertices  of  the  original  graph  which  can  be  colored  if  the  color  components 
corresponding  to  vertices  in  the  color  class  are  flipped.  For  each  of  the  10  possible 
choices  of  colors  Ci  and  c2  there  are  7  color  classes,  so  the  total  number  of  times 
that  we  count  the  number  of  vertices  that  can  be  colored  if  a  color  class  is  flipped 
is  70.  Since  each  uncolored  vertex  is  counted  at  least  once,  there  is  a  color  class 
such  that  flipping  all  color  components  in  *his  class  allows  us  to  color  at  least  1/70 
uncolored  vertices. 

Next  we  analyze  to  complexity  of  the  algorithm.  The  outer  loop  of  the  algo¬ 
rithm  that  iterates  over  layers  is  executed  O(logn)  times,  and  the  inner  loop  that 
colors  a  constant  fraction  of  uncolored  vertices  is  executed  0( log  n)  times  as  well. 
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Each  iteration  of  the  inner  loop  does  10  connected  component  computations.  70 
enumeration  and  10  calls  to  the  7- Color- Planar- Graph  procedure.  Since  each  con¬ 
nected  component  computation  can  be  done  in  O(logn)  time  on  CRCW  PRAM 
using  Shiloach-Vishkin  algorithm  [17],  the  7-Color-Planar-Graph  procedure  is  the 
bottleneck  of  the  inner  loop  (recall  that  it  runs  in  O(lognlg'n)  time).  The  overall 
running  time  of  the  algorithm  is  O(log3  n  lg*n). 

The  above  result  is  summarized  in  the  following  theorem. 


Theorem  6  A  planar  graph  can  be  5-colored,  in  0(lg3nlg’n)  time  on  a  CRCW 
PRAM  using  0(n )  processors. 


Using  the  techniques  described  in  this  paper  it  is  easy  to  construct  a  fast  algo¬ 
rithm  for  finding  a  maximal  matching  in  planar  graph. 


Theorem  7  A  maximal  matching  in  planar  graph  can  be  found  in  0(lgnlg*n)  time 
on  a  CRCW  PRAM. 


Proof  :  First,  the  algorithm  partitions  the  graph  into  layers,  such  that  the  nodes 
in  a  layer  are  of  degree  less  them  7  in  the  graph  induced  by  the  nodes  of  this 
layer  and  the  nodes  in  the  higher-numbered  layers.  The  algorithm  proceeds  by 
iteratively  returning  a  layer,  finding  a  maximal  matching  in  the  obtained  graph, 
and  removing  the  end-points  of  the  edges  in  the  matching.  At  the  end  of  each 
iteration  the  remaining  nodes  induce  a  graph  of  degree  zero  and  therefore  at  the 
beginning  of  each  iteration  the  maximum  degree  of  the  induced  graph  is  6.  Hence,  a 
maximal  matching  in  this  graph  can  be  found  in  0(lg*n)  time  by  finding  a  maximal 
independent  set  in  the  line-graph,  which  also  has  a  constant  maximum  degree.  Each 
iteration  takes  0(lg*n)  time  on  a  CRCW  PRAM  and  the  number  of  iterations  is 
O(lgn).  This  gives  0(lgnlg*n)  total  running  time.  | 
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6  Coloring  Polylogarithmic  Degree  Graphs 


This  section  describes  a  coloring  algorithm  for  graphs  with  maximum  degree  which 
is  polylogarithmic  in  the  size  of  the  graph.  For  A  =  w(lgn).  this  algorithm  has 
a  better  performance  than  the  algorithm  Color- Constant-Degree- Graph  described 
above. 

The  Poly-Log-Color  algorithm  is  shown  in  Figure  6  and  works  as  follows.  First, 
the  graph  is  partitioned  into  two  subgraphs  with  approximately  equal  number  of 
nodes,  and  the  subgraphs  are  recursively  colored  in  A  +  l  colors.  Then  we  iterate 
through  all  the  colors  of  one  of  the  subgraphs,  recoloring  each  node  with  a  color 
different  from  the  colors  of  all  of  its  neighbors. 

Theorem  8  The  algorithm.  Poly-Log- Color  colors  a  graph  with  a  maximum  degree 
of  A  with  A-fl  colors  in  0(A2lgn)  time. 

Proof :  Each  time  the  graph  is  partitioned  into  two  subgraphs  w-ith  approximately 
equal  number  of  nodes  and  therefore  the  depth  of  recursion  is  O(lgn).  At  each 
recursion  level  we  iterate  through  all  the  colors,  each  iteration  dominated  by  the 
time  to  find  a  color  different  from  the  colors  of  all  the  neighbors  of  a  node,  which 
takes  0(A)  time.  Hence  the  total  time  is  0(A2lgn)  on  a  EREW  PRAM.  | 

After  coloring  the  graph  in  A+l  colors  we  can  construct  an  MIS  of  the  graph  in 
0(  A2)  time.  Hence,  an  MIS  of  a  graph  with  a  polylogarithmic  maximum  degree  can 
be  found  in  0(A2  lgn)  time  on  EREW  PRAM  using  a  linear  number  of  processors. 


7  Lower  Bounds 

In  this  section  we  prove  two  lower  bounds  for  a  CRCW  PRAM  with  polynomial 
number  of  processors: 

•  Finding  a  MIS  in  a  general  graph  takes  f2(lgn/lglgn)  time. 
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PROCEDURE  Poly-Log-Color  (V,E) 
partition  V  into  Vr,V/  such  that  VrU  V\  -V 
Er  —  {(u,u?)  |  ( v,w )  e  E\  v,w  6  Vr } 

Ei  —  {(r,  tr)  |  (r,  w)  6  E;  v,  w  €  Vj} 

CT  —  Polv-Log-Color(Vj.,  £r) 

Cf  *-Polv-Log-Color(V(,  E;) 

V'  -  0  ’ 

for  all  v  €  Vf  in  parallel  do 

if  3(u,u;)  6  £  such  that  u  €  V/,  w  e  Vr  and  Ci(v)  =  Cr(u>) 
then  do 

V'  «-  V1  U  v 

end 

for  j  <—  1  to  A  + 1  do 

for  all  v  €  V'  in  parallel  do 
if  Ci(v)  =  j 
then  do 

Ci(v)  <-  max  {{1, 2, . . .  A  + 1}  -  {C(u>)  |  (v,w)  6  £'}} 
V'  «_  V'  -  v 

end 

end 

end 


Figure  6:  The  Coloring  Algorithm  for  Polylogarithmic  Maximum  Degree  Graphs 
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•  2-coloring  a  directed  list  takes  f2(lg  n/  lglgn)  time. 

The  first  lower  bound  complements  the  O(lgn)  CRCW  PRAM  upper  bound 
for  the  MIS  problem  that  is  achieved  by  Luby’s  algorithm  [15] .  The  second  lower 
bound  complements  Theorem  2  in  this  paper. 

Theorem  9  The  running  time  of  any  MIS  algorithm  on  a  CRCW  PRAM  with  a 
■polynomial  number  of  processors  is  Q(lgn/ lglgn). 

Proof :  Given  an  instance  of  MAJORITY,  we  construct  an  instance  of  MIS  in  con¬ 
stant  CRCW  PRAM  time.  MAJORITY  is  harder  that  PARITY  [6],  which  was 
proven  to  take  f2(lgn/ lglgn)  on  a  CRCW  PRAM  in  [2,3].  Therefore  the  lower 
bound  claimed  in  the  theorem  follows. 

Let  xi, i2 be  an  instance  of  MAJORITY.  We  construct  a  complete  bi¬ 
partite  graph  G  =  (V,  E)  with  nodes  corresponding  to  ‘O’  bits  of  the  input  on  one 
side  and  nodes  corresponding  to  T’  bits  on  the  other  side. 

v  =  {1 . n} 

E  =  ((i, i)  I  *i  5s  *,} 

To  construct  this  graph,  assign  a  processor  Pij  for  each  pair  1  <  i  <  j  <  n.  Then, 
each  processor  Pt]  writes  1  into  location  Mij  if  x<  ^  Xj  and  0  otherwise. 

A  maximal  matching  in  a  complete  bipartite  graph  is  also  a  maximum  one.  By 
constructing  a  maximal  independent  set  in  the  line-graph  G'  of  G,  one  can  find  a 
maximal  matching  in  G.  To  construct  the  graph  G'  assign  a  processor  P;jk  for  each 
distinct  i.j,  k  <  n.  Each  Ptjk  writes  1  into  location  M( ij),(j,k)  if  Mij  =  Mjk  =  1  and 
0  otherwise. 

The  MAJORITY  equals  to  1  if  and  only  if  there  is  an  unmatched  node  i  €  G 
such  that  Xi  =  1,  which  can  be  checked  on  a  CRCW  PRAM  in  constant  time.  | 

Theorem  10  The  time  to  2-color  a  directed  list  on  a  CRCW  PRAM  with  a  poly¬ 
nomial  number  of  processors  is  fl(lg  n/  lglg  n). 


21 


Proof :  We  show  a  constant  time  reduction  from  PARITY  to  the  2-coloring  of  a 
directed  list.  First,  we  show  how  to  construct,  in  constant  time,  a  directed  list  with 
elements  corresponding  to  all  the  input  bits  r,  with  value  of  1.  Let  xi,x2, . . .  ,xn 
be  an  instance  of  PARITY.  Associate  a  processor  P,  with  each  input  cell  Mi  that 

initially  holds  the  value  of  X{.  Associate  a  set  of  processors  p/k  with  each  index 
i,  1  <  k  <  j  <  i.  In  one  step,  each  processor  Pjk  reads  the  value  of  Mk  and,  if  it 
equals  to  1,  writes  1  into  Mf ,  effectively  computing  the  OR-function  on  the  input 
values  1, . . .  ,z,_i.  Assign  a  processor  P-  to  each  A//.  Each  processor  P- 

reads  M-  and  M/+1  and  writes  j  into  M[  if  and  only  if  Mj  ^  M-+l .  It  can  be  seen 
that  for  all  0  <  i  <  n,  M'  holds  ma x{j  |  j  <  i,Xj  =  1}. 

We  have  constructed  a  directed  list  with  elements  corresponding  to  all  the  input 
bits  X{  with  value  of  1.  Assume  this  list  is  2-colored.  Then  PARITY  equals  to  1  if 
and  only  if  both  ends  of  the  list  are  colored  in  the  same  color,  which  can  be  checked 
in  constant  time.  | 
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Abstract 

Due  to  chip  area  and  pin  count  constraints,  large 
concentrator  switches  sometimes  must  be  partitioned 
among  several  chips.  This  paper  presents  designs 
for  two  multichip  partial  concentrator  switches,  both 
of  which  follow  from  a  lemma  showing  that  an  e- 
nearsorter  is  also  an  (n,m,  1  —  c/m)  partial  concen¬ 
trator. 

The  first  switch,  based  on  the  Revsort  algorithm,  is 
an  (n,  m,  1  -  0(n3^4/m))  partial  concentrator  switch 
with  at  most  2i/n  +  f(lgn)/2]  data  pins  per  chip, 
0(,/n)  chips,  and  volume  Q(n3/~).  A  message  incurs 
31gn+ 0(1)  gate  delays  in  passing  through  the  switch. 

The  second  switch,  based  on  Columnsort,  is  an 
(n,m,  1  —  O(n2~20 /m))  partial  concentrator  switch 
with  0(n'J)  data  pins  per  chip,  0(n1'*'3)  chips,  and 
volume  0(n1+/3),  for  any  1/2  <  0  <  1.  A  message 
incurs  4/Jlgn  +  0(1)  gate  delays. 

1  Introduction 

The  problem  of  concentrating  relatively  few  signals 
on  many  input  lines  onto  a  lesser  number  of  output 
lines  must  be  solved  in  many  kinds  of  communication 
networks.  In  many  parallel  computing  systems,  in¬ 
formation  is  packaged  into  messages  which  are  routed 
among  the  processors.  The  switches  that  route  these 
messages  sometimes  require  more  chip  area  or  input 
and  output  wires  than  a  single  chip  can  supply.  This 
paper  presents  two  designs  for  fast  multichip  partial 
concentrator  switches  suitable  for  routing  bit-serial 
messages  in  a  parallel  supercomputer.  The  key  lemma 
of  this  paper  may  be  used  to  justify  other  partial  con¬ 
centrator  designs. 

This  research  was  supported  in  part  by  the  Defense  Ad¬ 
vanced  Research  Projects  Agency  under  Contract  NOOOl-t- 
8O-C-0622  and  in  part  by  a  National  Science  Foundation 
Fellowship. 


An  n-by-m  perfect  concentrator  switch  has  n  in¬ 
put  wires  Xi,Xi, . . .  ,Xn  and  m  <  n  output  wires 
Yi,y2,...,ym.  The  switch  can  establish  m  disjoint 
electrical  paths  from  any  set  of  m  input  wires  to  the 
m  output  wires.  A  perfect  concentrator  switch  al¬ 
ways  routes  as  many  messages  as  possible.  Specifi¬ 
cally,  whenever  k  out  of  the  n  input  wires  of  an  n-by- 
m  perfect  concentrator  switch  carry  messages,  one  of 
the  following  is  true: 

•  If  k  <  m,  then  an  electrical  path  is  established 
from  each  input  wire  that  contains  a  message  to 
an  output  wire. 

•  If  k  >  m,  then  each  output  wire  has  an  electrical 
path  established  from  an  input  wire  that  contains 
a  message. 

When  k  >  m,  some  messages  cannot  be  successfully 
routed,  in  which  case  we  say  the  switch  is  congested. 
Typical  ways  of  handling  unsuccessfully  routed  mes¬ 
sages  in  a  routing  network  are  to  buffer  them,  to  mis- 
route  them,  or  to  simply  drop  them  and  rely  on  a 
higher-level  acknowledgment  protocol  to  detect  this 
situation  and  resend  them.  The  switch  designs  in  this 
paper  are  compatible  with  any  of  these  congestion  con¬ 
trol  methods. 

One  way  to  create  a  perfect  concentrator  switch  is 
with  a  hyperconcentrator  switch.  An  n-by-n  hyper¬ 
concentrator  switch  has  n  input  wires  X\ ,  Xj, . . . ,  Xn 
and  n  output  wires  Yi,Yt,...,Yn.  The  switch  can 
establish  disjoint  electrical  paths  from  any  set  of  k  in¬ 
put  wires,  for  any  1  <  k  <  n,  to  the  first  k  output 
wires  Vi,  Vj, . . .,  1*.  In  other  words,  we  route  the  k 
messages  to  the  first  k  output  wires.  We  can  make 
any  n-by-m  perfect  concentrator  switch  from  an  n- 
by-n  hyperconcentrator  switch  by  simply  choosing  the 
first  m  output  wires  of  the  hyperconcentrator  switch, 
Vj ,  Vj, . . . ,  Ym,  as  the  m  output  wires  of  the  perfect 
concentrator  switch. 
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An  efficient  n-by-n  hyperconcentrator  switch  design 
is  given  in  [1]  and  [2],  This  switch  has  a  highly  regu¬ 
lar  layout  in  both  ratioed  nMOS  and  domino  CMOS 
technologies,  and  a  signal  incurs  exactly  21gn  gate 
delays  through  the  switch.1  This  switch  uses  0(n2) 
components  and  has  area  0(n2). 

Partitioning  this  hyperconcentrator  switch  among 
multiple  chips  with  p  pins  each  requires  Q((»/;>)2) 
chips,  since  each  p-pin  chip  has  area  0(p2)  and  there 
are  0(n2)  components  to  partition.  We  may  need  to 
partition  the  switch  for  two  reasons: 

1.  The  0(n2)  area  may  exceed  the  available  chip 
area. 

2.  If  the  switch  is  to  be  packaged  by  itself  on  a  chip, 
it  may  require  more  input  and  output  pins  than 
are  provided  by  the  packaging  technology. 

A  different  hyperconcentrator  switch,  comprised  of  a 
parallel  prefix  circuit  and  a  butterfly  network  [1],  can 
be  built  in  volume  0(n3/2)  with  0(n  lg  n)  chips  and 
as  few  as  four  data  pins  per  chip,  but  this  switch  is 
not  combinational.  Although  its  sequential  control 
is  not  very  complex,  it  is  not  as  simple  as  that  of  a 
combinational  circuit. 

Partial  concentrator  switches,  as  we  shall  see  in  Sec¬ 
tions  4  and  5,  can  be  combinational  with  relatively 
low  gate  delays.  Yet,  given  chips  with  p  pins,  we  can 
partition  n-input  partial  concentrator  switches  using 
only  @(n/p)  chips.  An  (n,m,a)  partial  concentrator 
switch  has  n  input  wires  Xi,X2,...,Xn,  m<n  out¬ 
put  wires  V\,  Yj, . . . ,  Ym,  and  a  fraction  0  <  a  <  1 
such  that  disjoint  electrical  paths  may  be  established 
from  any  set  of  k  input  wires,  for  any  1  <  k  <  am,  to 
k  output  wires. 

A  lightly  loaded  partial  concentrator  switch  is  sim¬ 
ilar  to  a  perfect  concentrator  switch.  If  there  are  k 
messages  entering  an  (n,m,  a)  partial  concentrator 
switch,  one  of  the  following  is  true: 

•  If  Jfc  <  am,  then  an  electrical  path  is  established 
from  each  input  wire  that  contains  a  message  to 
an  output  wire. 

•  If  k  >  am,  then  at  least  am  electrical  paths  are 
established  from  input  wires  containing  messages 
to  output  wires. 

We  call  the  fraction  a  the  load  ratio.  If  a  partial  con¬ 
centrator  switch  is  lightly  loaded,  i.e.,  the  number  of 
messages  entering  is  at  most  am,  then  all  the  mes¬ 
sages  are  routed  to  output  wires. 

1  We  uie  the  notation  lg  n  to  denote  log]  n. 


An  ( n/a.m/a.a )  partial  concentrator  switch  can 
be  used  anywhere  an  n-by-m  perfect  concentrator 
switch  is  required.  Consider  a  set  of  i  <  m  mes¬ 
sages  to  be  routed  through  an  n-by-m  perfect  con¬ 
centrator  switch.  For  the  ( n/a.m/a,a )  partial  con¬ 
centrator  switch,  we  have  that  k  <  m  =  a  (m/a), 
and  thus  all  k  messages  are  routed  to  output  wires. 
If  there  are  instead  k  >  m  messages  to  be  routed 
through  the  perfect  concentrator  switch,  we  have  that 
k  >  m  =  a  ■  (m/a)  for  the  (n/a  .m/a,  a)  partial  con¬ 
centrator  switch,  and  thus  m  output  wires  carry  mes¬ 
sages.  In  either  case,  the  partial  concentrator  switch 
performs  the  same  function  as  the  perfect  concentra¬ 
tor  switch,  at  the  cost  of  a  1/a-factor  increase  in  the 
number  of  input  and  output  wires. 

In  this  paper,  we  show  a  connection  between  near¬ 
sorting  and  partial  concentration.  We  then  use  this 
relationship  to  design  two  efficient  multichip  partial 
concentrator  switches,  both  of  which  use  the  hyper¬ 
concentrator  switch  of  [l]  and  [2]  as  a  subcircuit  on  a 
single  chip. 

The  remainder  of  this  paper  is  organized  as  fol¬ 
lows.  Section  2  covers  some  basic  terminology  and 
describes  the  message  format  upon  which  the  switches 
are  based.  Section  3  defines  nearsorting  and  shows  the 
relationship  between  nearsorting  and  partial  concen¬ 
tration.  Section  4  presents  a  design  for  a  partial  con¬ 
centrator  switch  based  on  the  Revsort  algorithm  for 
sorting  on  a  mesh;  Section  5  does  the  same,  but  based 
on  the  Columnsort  algorithm  for  sorting  on  a  mesh. 
Finally,  Section  6  contains  further  remarks  about  mul¬ 
tichip  concentrator  switches. 

2  Preliminaries 

In  this  section,  we  define  some  basic  terminology  and 
mathematical  conventions  and  present  the  message 
format  assumed  by  the  switch  designs. 

Bit  and  boolean  values  are  denoted  by  "I”  and  “0” 
for  TRUE  and  FALSE  respectively. 

We  assume  that  the  switches  route  bit-senal  mes¬ 
sages.  Each  message  is  formed  by  a  stream  of  bits 
arriving  at  a  wire  at  the  rate  of  one  bit  per  clock  cy¬ 
cle.  The  first  bit  of  each  message  that  arrives  at  an 
input  wire  is  the  valid  bit,  indicating  whether  subse¬ 
quent  bits  arriving  on  that  wire  form  a  valid  message 
or  an  invalid  message.  The  bit  sequence  following  a 
valid  bit  of  1  forms  a  valid  message,  which  we  would 
like  to  be  routed  from  an  input  wire  to  an  output  wire 
of  the  switch.  From  there  it  may  pass  through  the 
remainder  of  the  routing  network.  A  valid  bit  of  0 
indicates  an  invalid  message,  wlii'h  does  not  need  to 
be  routed  to  an  output  wire. 
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The  valid  bits  all  arrive  at  the  input  wires  of  a 
switch  during  the  same  clock  cycle,  which  we  call 
setup.  An  external  control  line  signals  setup.  Mes¬ 
sage  bits  entering  through  input  wires  at  cycles  after 
setup  follow  the  electrical  paths  in  the  switch  that  are 
established  during  setup. 

We  shall  adopt  some  notational  conventions  to  ease 
the  exposition  in  the  remainder  of  this  paper.  Upper¬ 
case  symbols  denote  wire  names  and  lowercase  sym¬ 
bols  denoce  integer  values.  We  shall  also  use  upper¬ 
case  symbols  tc  denote  bit  values  on  the  wires  they 
name  when  the  usage  is  unambiguous.  Wire  names 
will  usually  be  subscripted. 

A  sequence  of  values  is  sorted  if  it  is  in  nonincreas¬ 
ing  order.  The  valid  bits  output  by  an  n-by-n  hyper- 
concentrator  switch  are  thus  sorted,  since  if  there  are 
k  valid  messages,  we  have 

vi.y’a . y*  =  1 

yjt-4-l .  Vfe+2.  ■  ■  -,Yn  =  0 

during  setup. 

Concentrators  were  originally  presented  as  graphs 
in,  for  example.  [4,5,8].  The  term  “hyperconcentra¬ 
tor”  is  due  to  Valiant.  Vertex-disjoint  paths  from  des¬ 
ignated  input  nodes  to  designated  output  nodes  are 
the  concentrator  graph  counterpart  of  the  combina¬ 
tional  routing  paths  established  during  setup  in  the 
concentrator  switches  of  this  paper. 

3  Nearsorting  and  Partial  Concentra¬ 
tion 

In  this  section,  we  define  e-nearsorting  and  show  its 
relationship  to  partial  concentration.  The  key  lemma 
proven  in  this  section  is  used  in  the  next  two  sections 
to  justify  partial  concentrator  switch  constructions. 

A  sequence  of  values  is  e-nearsorted  if  each  element 
in  the  sequence  is  within  e  positions  of  where  it  be¬ 
longs  in  the  fully  sorted  sequence.  For  example,  the 
sequence  5,3.6, 1,4,2  is  2-nearsorted  since  each  cle¬ 
ment  is  at  most  two  places  away  from  its  correct  po¬ 
sition  in  the  fully  sorted  sequence  6, 5, 4, 3, 2, 1.  The 
value  e  need  not  be  a  constant;  we  will  usually  let  e  be 
a  function  of  the  size  of  the  sequence,  A  fully  sorted 
sequence  is  also  O-nearsorted. 

Since  we  are  only  interested  in  nearsorting  valid 
bits,  for  the  remainder  of  this  paper  we  shall  be  con¬ 
cerned  only  with  inputs  whose  value  is  either  0  or 
1.  We  say  that  a  sequence  of  values  is  clean  if  they 
all  have  the  same  value;  otherwise  the  sequence  is 
dirty.  The  following  lemma  describes  an  e-nearsorted 
sequence  of  0’s  and  I’s. 
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Figure  1:  A  fully  sorted  sequence  of  k  l’s  and  n  —  k 
0’s  and  an  e-nearsorted  sequence  of  the  same  values.  The 
e-nearsorted  sequence  consists  of  a  clean  sequence  of  at 
least  k  —  t  l’s  followed  by  a  dirty  sequence  of  at  most  It 
bits  followed  by  a  clean  sequence  of  at  least  n  —  k  —  e  0’s. 


Lemma  1  A  sequence  of  n  bits,  containing  k  l’s  and 
n  —  k  0’s,  is  e-nearsorted  if  and  only  if  it  consists  of 
a  clean  sequence  of  at  least  k  —  e  l’s  followed  by  a 
dirty  sequence  of  at  most  2e  bits  followed  by  a  clean 
sequence  of  at  least  n  —  k  —  e  0's. 

Proof  (=>)  As  shown  in  Figure  1,  a  fully  sorted  se¬ 
quence  of  k  l’s  and  n  —  k  0’s  is  simply  k  Vs  followed 
by  n  —  k  0’s.  In  an  e-nearsorted  sequence  of  the  same 
values,  each  1  appears  within  the  first  k  +  e  positions, 
and  each  0  appears  within  the  last  n  —  k  +  e  positions. 
The  only  dirty  sequence  within  the  e-nearsorted  se¬ 
quence  is  therefore  centered  at  the  ith  position  and 
extends  e  positions  to  either  side.  The  lemma  then 
follows. 

(•^)  Again  referring  to  Figure  1,  each  1  is  within 
the  first  k  +  e  positions,  and  each  0  is  within  the  last 
n  —  k+e  positions.  The  sequence  is  thus  e-nearsorted. 

□ 

The  following  lemma  is  the  key  lemma  that  relates 
e-nearsorting  to  partial  concentration. 

Lemma  2  Let  P  be  a  switch  with  n  inputs 
Xi,  X 2, . . . ,  and  n  outputs  Y\ ,  Vj, . . . ,  Yn,  and  sup¬ 
pose  that  P  e-nearsorts  valid  bits.  Then  by  restricting 
the  outputs  of  P  to  YltY2,...  ,Ym,  for  any  m  <  n, 
P  is  an  (n,  m,a)  partial  concentrator  switch,  where 
a  =  1  —  e/m. 


Proof  Consider  any  input  to  switch  P  containing  k 
I’s  and  n  —  k  0’s.  We  have  am  =  (1  —  e/m)m  =  m— e, 
and  there  are  two  cases. 

Case  1:  k  <  am  =  m  —  e.  We  have  m  >  k  +  e. 
Since  P  is  an  e-nearsorter,  each  I  appears  within  the 

outputs  {Yi,Y2 . Yk+,}  C  {Yi,Y2,...,Ym}.  Thus, 

each  1  is  routed  to  an  output  of  the  partial  concentra¬ 
tor  switch. 
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Figure  2:  The  output  of  an  (n,  m,  1  —  e/m)  partial  concen¬ 
trator  switch  that  is  not  s-nearsorted.  This  switch  routes 
m  —  t  out  of  k  >  m  —  e  l’s  to  the  first  m  outputs,  but  the 
remaining  k  —  m  +  t  l’s  are  routed  to  the  last  k  —  m  +  z 
out  of  the  n  outputs.  If  we  have  k  +  e  <  n  —  (k  —  m  +  r), 
or  equivalently,  k  +  s  <  (n  +  m)/2,  then  the  last  k  —  m  +  c 
l's  are  not  within  e  positions  of  output  Yi,  and  thus  the 
output  sequence  is  not  e-nearsorted. 


Case  2:  k  >  am  =  m  —  e.  We  have  m  < 
k  +  z.  Again,  each  1  appears  within  the  outputs 
From  Lemma  1,  we  know  that 
at  most  z  of  the  outputs  {Yi ,  Yx, . . . ,  Yt+C}  carry  0’s, 
so  at  most  e  of  the  outputs  {Yi,  Y2, . . . ,  Ym}  carry 
0’s.  Thus,  at  least,  m  -  z  —  am  of  the  outputs 
{Yi,  Y2l . . .,  Ym}  carry  l’s. 

We  conclude  that  by  restricting  the  outputs  of  P  to 
Y: ,  Y2, . . . ,  Ym,  P  is  an  (n,  m,  1  -  z/m)  partial  concen¬ 
trator  switch.  □ 

The  converse  of  Lemma  2  is  not  necessarily  true. 
As  shown  in  Figure  2,  if  an  (n,  m,  1  —  e/m)  partial 
concentrator  switch  routes  m  —  c  out  of  k  >  am  = 
m— e  l’s  to  the  first  m  outputs,  the  remaining  k  —  m+e 
1  ’s  may  be  routed  to  the  last  k  —  m  +  z  out  of  the  n 
outputs.  In  this  case,  if  there  are  more  than  e  outputs 
between  Y*  and  Yn_(fc_m+t),  then  the  output  sequence 
is  not  e-nearsorted. 

4  A  Revsort-Based  Partial  Concen¬ 
trator  Switch 

In  this  section,  we  present  a  design  for  an  (n,m,  a) 
partial  concentrator  switch  that  uses  ©(x/n)  chips 
with  only  ©(x/n)  data  pins  each.  The  basic  building 
block  is  the  hyperconcentrator  switch  of  [1]  and  [2] 
placed  on  a  chip.  Each  message  incurs  31gn  +  0(1) 
gate  delays  in  passing  through  the  switch.  The  load 
ratio  is  a  =  1  —0(n3^4/m).  Most  of  the  results  of  this 
section  originally  appeared  in  [1]. 

This  partial  concentrator  switch  can  be  imple¬ 
mented  in 

•  two  dimensions  with  @(n2)  area  and  one  chip 
type  with  2 x/n  data  pins,  or 

•  three  dimensions  with  0(n3^2)  volume,  two  chip 


types  with  at  most  2x/n  +  f(lg  n)/2]  pins,  and  two 
board  types. 

The  design  is  based  on  Schnorr  and  Shamir’s 
Rcvsort  algorithm  for  sorting  on  a  mesh  [7],  which, 
although  not  optimal  for  sorting  on  a  mesh,  is  sim¬ 
ple.  The  idea  behind  the  partial  concentrator  switch 
is  to  nearsort  a  x/n-by-x/n  matrix  of  valid  bits.  The 
m  output  wires  of  the  switch  correspond  to  the  first 
m  nearsorted  matrix  entries. 

We  need  some  basic  definitions.  We  assume  that  the 
rows  and  columns  of  the  x/n  x  x/n  matrix  are  num¬ 
bered  0,1,...,  s/n  —  1  and  that  y/n  =  2q  for  some  in¬ 
teger  q.  We  also  define,  for  any  integer  i,  0  <  »  <  x/n, 
rev(i)  to  be  the  binary  number  obtained  by  reversing 
the  q  bits  in  the  binary  representation  of »,  including 
the  leading  zeros.  For  example,  when  y/n  =  16,  ret>(3) 
is  12. 

The  partial  concentrator  switch  is  built  from  three 
stages,  each  stage  containing  x/n  hyperconcentrator 
chips.  Each  x/n-by-x/n  hyperconcentrator  chip  serves 
to  fully  sort  a  row  or  column  of  valid  bits  in  the  un¬ 
derlying  matrix.  We  shall  denote  by  H the  ith  hy¬ 
perconcentrator  chip  in  stage  /,  for  1  <  /  <  3  and  0  < 
i  <  v/n,  with  input  wires  Xij, 0,  i 

and  output  wires  Yf.i.o,  Y|,,-.i . 

The  general  idea  of  the  construction  of  the  partial 
concentrator  switch  is  as  follows.  Each  stage  1  chip 
corresponds  to  a  column  of  the  matrix,  so  the  stage  1 
chips  fully  sort  the  valid  bits  in  each  column.  The 
input  and  output  wires  Xijti  and  Yijj  represent  the 
value  of  the  matrix  element  at  row  i  and  column  j 
before  and  after  sorting. 

The  wiring  between  stages  1  and  2  is  effectively  a 
matrix  transposition,  accomplished  by  connecting  the 
output  wire  Y\  j,«  to  the  input  wire  ATj.i,;  for  0  <  i,  j  < 
y/n.  Each  stage  2  chip  then  corresponds  to  a  row  of 
the  matrix,  so  the  stage  2  chips  fully  sort  the  valid 
bits  in  each  row.  The  input  and  output  wires  X 2,*,j 
and  Yo,i,j  represent  the  value  of  the  matrix  element  at 
row  i  and  column  j  before  and  after  sorting. 

The  wiring  between  stages  2  and  3  is  the  compo¬ 
sition  of  two  matrix  permutations.  We  first  cycli¬ 
cally  rotate  row  i  by  rev(»)  places  to  the  right,  for 
0  <  i  <  x/n  That  is,  the  matrix  element  in  row  i 
and  column  j,  for  0  <  i,  j  <  \/n ,  is  moved  to  row  i 
and  column  (reti(i)  +  j)  mod  x/n.  The  matrix  is  then 
transposed.  Each  stage  3  chip  then  corresponds  to  a 
column  of  the  matrix,  so  the  stage  3  chips  fully  sort  the 
valid  bits  in  each  column.  The  two  permutations  are 
accomplished  in  one  wiring  step  by  connecting  the  out¬ 
put  wire  Y2.ij  to  the  input  wire  -Y3i((M(iW)niod^ii, 
for  0  <  i,j  <  y/ii. 


4 


The  output  wires  of  the  partial  concentrator  switch 
are  the  first  m  output  wires  of  the  matrix  in  row-major 
order,  or  yjj.j  j  for  0  <  i  <  and  0  <  j  <  y/n 

or  i  =  \_m/ y/n\  and  0  <  j  <  m  mod  y/n. 

Like  the  hyperconcentrator  chips  from  which  it  is 
built,  the  partial  concentrator  switch  is  a  combina¬ 
tional  circuit.  The  routing  paths  are  established  by 
the  valid  bits  during  setup,  and  subsequent  bits  fol¬ 
low  along  these  paths. 

To  see  that  this  construction  does  indeed  yield  an 
(n.  m,  1  —  0(n3/4/m))  partial  concentrator  switch,  we 
first  observe  that  its  operation  is  equivalent  to  the 
following  algorithm,  which  corresponds  to  the  first  1^ 
iterations  of  Revsort: 

Algorithm  1  Given  a  y/n  x  y/n  matrix  with  y/n  = 

2 1  and  matrix  element  values  of  0  or  1,  perform  the 
following  four  steps: 

1.  Fully  sort  the  columns. 

2.  Fully  sort  the  rows. 

3.  For  0  <  t  <  y/n,  cyclically  rotate  row  i  by  rev(i) 
places  to  the  right,  i.e.,  move  the  element  in  col¬ 
umn  j  to  column  (rev(i)  +  j )  mod  y/n. 

4.  Fully  sort  the  columns. 

The  three  sorting  steps  correspond  to  the  three  stages 
of  hyperconcentrator  chips  in  the  partial  concentra¬ 
tor  twitch  construction.  The  wiring  between  stages  1 
and  2  corresponds  to  changing  from  sorting  columns 
to  sorting  rows.  The  wiring  between  stages  2  and  3 
corresponds  to  the  cyclic  rotations  within  rows  and 
changing  from  sorting  rows  to  sorting  columns.  We 
are  now  ready  to  prove  that  this  construction  works. 

Theorem  3  The  Revsort-based  construction  yields 
an  (n,  m,  1  —  0(n3f4/m))  partial  concentrator  siuitch. 

Proof  Both  [1]  and  [7]  show  that  after  running  Al¬ 
gorithm  1  on  a  \/n  x  y/n  matrix  with  elements  val¬ 
ued  0  or  1,  the  matrix  consists  of  only  clean  rows 
of  l’s  at  the  top,  clean  rows  of  0’s  at  the  bottom, 
and  at  most  2  fn1''4]  -  1  dirty  rows  in  the  middle. 
Since  each  row  contains  y/n  elements,  there  are  at 
most  0(n3^4)  dirty  bits.  By  Lemma  1,  the  sequence 
is  0(n3/4)-nearsorted,  and  by  Lemma  2.  the  circuit  is 
an  (n.m,  1  -0(n3/4/m))  partial  concentrator  switch. 

□ 

Figure  3  shows  a  two-dimensional  layout  of  the 
switch  using  3 y/n  hyperconcentrator  chips,  with  2 y/n 
data  pins  each.  We  simply  use  crossbar  wiring  to 
permute  the  wires  between  hyperconcentrator  chips 


Figure  3:  A  two-dimensional  layout  of  the  Revsort-based 
partial  concentrator  switch  with  n  =  64  inputs  and  m  =  28 
outputs.  The  electrical  paths  established  by  24  valid 
messages  are  shown  with  heavy  lines.  The  output  wires 
are  the  top  four  output  wires  of  hyperconcentrator  chips 
f/3.0,  ih.i,  if 3, 2,  Hi, 3  and  the  top  three  output  wires  of  hy¬ 
perconcentrator  chips  Hi,t,  Hi.s,Hi,t,  Hij. 

of  consecutive  stages.  The  area  of  this  layout  is  0(n2) 
since  the  crossbar  wiring  area  is  0(n2),  which  dom¬ 
inates  the  total  chip  area  of  0(n3^2).  (Each  stage 
of  y/n~by-y/n  hyperconcentrator  chips  consists  of  y/n 
chips,  each  with  area  0(n),  for  a  total  chip  area  of 
0(n3/2).) 

A  signal  incurs  2  fig  ,/n]  +0(1)  gate  delays  in  pass¬ 
ing  through  each  chip.  The  2  pg\/nl  gate  delays  are 
from  the  hyperconcentrator  switch  within  the  chip. 
The  I/O  pad  circuitry  accounts  for  the  additional  0(1) 
delay.  The  total  number  of  gate  delays  incurred  by  a 
signal  passing  through  the  entire  partial  concentrator 
switch  is  thus 

Gflgv^l +0(1)  <  6  lg  /n  +  0(1) 

=  3  lgn  +  0(1)  • 

As  shown  in  Figure  4,  we  can  package  the  partial 
concentrator  switch  in  three  dimensions  using  volume 
0(n30).  Each  circuit  board  contains  one  y/n-by-y/n 
hyperconcentrator  chip,  corresponding  to  one  row  or 
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cyclic  rotation  control 


stage  1  stage  2  stage  3 


Figure  4:  The  three-dimensional  packaging  of  the  Revsort-based  partial  concentrator  switch  for  n  =  64.  Each  stack 
contains  y/n  circuit  boards  and  corresponds  to  one  stage.  Each  board  contains  one  y/n-by-y/n  hyperconcentrator  chip, 
and  boards  in  stack  2  follow  the  hyperconcentrator  chip  by  a  ,/n-bit  barrel  shifter  chip  to  perform  the  cyclic  rotation  of 
each  row.  The  lg  y/n  control  bits  that  determine  the  shift  amount  for  each  barrel  shifter  are  hard-wired. 


column  of  the  matrix.  Each  of  the  three  stacks  con¬ 
tains  y/n  boards  and  represents  one  stage.  The  wires 
cross  stack  junctions  in  a  y/n  x  y/n  array,  with  the 
valid  bit  value  of  the  wire  in  row  i  and  column  j  equal 
to  the  value  of  the  matrix  element  in  the  same  position 
at  the  corresponding  step  of  Algorithm  1. 

The  matrix  transpose  between  stages  1  and  2  is  per¬ 
formed  in  the  natural  way,  with  the  »th  output  wire 
from  board  j  in  stage  1  going  straight  across  the  junc¬ 
tion  to  be  the  j th  input  wire  of  board  i  in  stage  2. 
The  wiring  permutation  between  the  hvperconcentra- 
tor  chips  of  stages  2  and  3  includes  the  cyclic  rotations 
of  the  rows,  followed  by  the  transpose.  The  transpose 
is  performed  in  the  natural  way  once  again.  We  per¬ 
form  the  cyclic  rotation  by  following  each  stage  2  hy¬ 
perconcentrator  chip  by  a  >/ra-bit  barrel  shifter  on  the 
same  ooard.  The  barrel  shifter  has  y/n  input  wires, 
y/n  output  wires,  and  jig  i/n")  control  bits  which,  in¬ 
terpreted  as  a  binary  integer,  determine  the  rotation 
amount.  We  hardwire  the  control  bits  in  the  it h  board 
to  have  the  value  rev(i). 

We  use  only  two  board  types,  3^  hyperconcen¬ 
trator  chips,  and  y/n  barrel  shifters  in  building  the 
switch.  All  2 y/n  boards  in  stages  1  and  3  are  identi¬ 
cal,  as  are  all  y/n  stage  2  boards.  The  barrel  shifters 
require  2 y/n  +  pg\/n"|  =  2 y/n+  f(lgn)/2]  data  pins. 
The  hardwiring  of  the  barrel  shifter  control  bit  values 
can  be  performed  after  the  boards  have  been  fabri¬ 
cated. 

To  see  that  the  volume  is  0(n3/2),  we  need  only 
consider  the  stage  2  stack,  which  has  the  most  com¬ 
ponents.  Each  board  contains  a  y/n-by -y/n  hypercon¬ 


centrator  chip  and  a  v^-bit  barrel  shifter,  both  hav¬ 
ing  area  0(n).  The  whole  stack  of  y/n  boards,  and 
therefore  the  entire  switch,  has  volume  d(n3/l2). 

Since  the  barrel  shift  amounts  are  hardwired  and 
never  change,  the  barrel  shifters  introduce  only  a  con¬ 
stant  number  of  gate  delays.  A  signal  therefore  incurs 
31g»  +  0(l)  gate  delays  in  passing  through  the  three- 
dimensional  switch. 

Letting  p,  the  number  of  pins  per  chip,  be  Q(y/n), 
both  the  two-dimensional  and  three-dimensional  lay¬ 
outs  use  only  0(n/p)  chips. 

5  A  Columnsort-Based  Partial  Con¬ 
centrator  Switch 

In  this  section,  we  present  a  design  for  an  (n,m,a) 
partial  concentrator  switch  that  uses  0(n1_<J)  chips 
with  ©(n^)  pins  each,  where  1/2  </?<!.  The  ba¬ 
sic  building  block  is  a  0(n^)-by-0(n^)  hyperconcen¬ 
trator  chip.  Each  message  incurs  4plgn  +  0(1)  gate 
delays  in  passing  through  the  switch.  The  load  ratio 
is  n  =  1  —  O(n2~20 /in).  This  switch  can  be  imple¬ 
mented  in  two  dimensions  with  area  0(n2)  or  in  three 
dimensions  with  volume  0(n1+<J).  Table  1  shows  re¬ 
source  measures  for  the  Revsort-based  switch  and  the 
values  of  (3  at  which  the  switch  of  this  section  matches 
them  asymptotically. 

The  design  is  based  on  Leighton’s  Columnsort  al¬ 
gorithm  [3]  for  sorting  n  elements  on  an  r  x  s  mesh, 
where  n  =  rs  and  s  evenly  divides  r.  The  idea  behind 
this  partial  concentrator  switch  is  to  (s—  l)2-nearsort 
an  r  x  s  matrix  of  valid  bits.  As  with  the  switch  of 
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Revsort 

Columnsort, 

*  =  1/2 

Columnsort, 

3  =  5/8 

Columnsort, 

3  =3/4 

pins  per  chip 

©(n1/2) 

©(n1"2) 

Q(nl/8) 

©(n3/< ) 

chip  count 

©{  il/J) 

©(n3/») 

©(n1/4) 

load  ratio 

l  -  0(n3'* /m) 

1  —  0(n/m) 

1  -0(n3l*/m) 

1  —  0(nl/i /m) 

gate  delays 

31gn  +  0(l) 

21gn  +  0(l) 

fig"  +  0(1) 

31gn  +  0(1) 

volume 

©(n3/2) 

0(rt3/J) 

r 

Table  1:  Resource  measures  for  the  Revsort-based  partial  concentrator  switch  and  the  values  of  3  at  which  the  Column- 
sort-based  switch  matches  them  asymptotically. 
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Figure  5:  Row-major  and  column-major  positions  of  ele¬ 
ments  in  a  6  x  3  matrix. 

the  previous  section,  the  m  output  wires  of  the  switch 
correspond  to  the  first  m  matrix  entries. 

We  may  identify  a  matrix  entry  by  either  its  row 
and  column  position  or  by  its  position  in  row-major 
or  column-major  order.  All  numbering  starts  at  0. 
Thus,  the  rows  are  numbered  0, 1, . . .,  r  —  1  and  the 
columns  are  numbered  0, 1, . . . ,  a  —  1.  The  row-major 
position  of  the  matrix  entry  in  row  «'  and  column  j 
is  RM(i,j)  =  si  +  j,  and  its  column-major  position 
is  CM(i,j)  =  rj  +  i.  For  example,  Figure  5  shows 
the  row-major  and  column-major  positions  of  a  6  x  3 
matrix.  We  have  that  0  <  RM(i,j),CM(i,j)  < 
n.  The  row  and  column  position  corresponding  to 
the  entry  in  row-major  position  x  is  RM~1(x)  = 
([x/sj  ,  x  mod  s). 

The  partial  concentrator  switch  is  built  from  two 
stages,  each  stage  containing  s  hyperconcentrator 
chips.  Since  the  hyperconcentrator  chips  are  combi¬ 
national,  so  is  the  partial  concentrator  switch.  Each 
r-by-r  hyperconcentrator  chip  corresponds  to  a  col¬ 
umn  of  the  underlying  matrix,  fully  sorting  the  col¬ 
umn.  We  shall  denote  by  Hij  the  jth  hyperconcen¬ 
trator  chip  in  stage  /,  for  /  =  1, 2  and  0  <  j  <  s,  with 
input  wires  Xjj.o,  ■  ■  ■ .  X and  output  wires 

Wires  Xij and  Vjj,.-  corre¬ 
spond  to  the  matrix  element  in  row  i  and  column  j. 

The  wiring  between  stages  1  and  2  corresponds 
to  converting  the  matrix  from  column-major  to  row- 


major  ordering,  using  the  composition  of  functions 
RM~X  o  CM.  We  connect  the  output  wire  Yijj  to 
the  input  wire  *2,(rj+<)"»odf,l(rj+o/«j,  for  0  <  »  <  r 
and  0  <  j  <  s. 

Once  again,  the  output  wires  of  the  partied  con¬ 
centrator  switch  are  the  first  m  output  wires  of  the 
matrix  in  row-major  order.  Wc  use  wires  Vj for 
0  <  i  <  [m/sj  and  0  <  j  <  s  or  i  =  [m/sj  an<^ 

0  <  j  <  m  mod  a. 

To  show  that  this  circuit  (s-  l)2-nearsorts  the  valid 
bits,  we  first  observe  that  its  operation  is  equivalent 
to  the  following  algorithm,  which  corresponds  to  the 
first  three  steps  of  Columnsort: 

Algorithm  2  Given  an  r  x  s  matrix  of  n  elements, 
where  n  =  r»,  and  matrix  values  of  0  or  1,  perform 
the  following  three  steps: 

1.  Fully  sort  the  columns. 

2.  Convert  the  matrix  from  column-major  to  row- 
major  order,  i.e.,  move  the  element  in  row  j  and 
column  j  to  row  [(rj  +  *)/sJ  and  column  (rj  + 
*)  mod  s. 

3.  Fully  sort  the  columns. 

The  two  stages  of  hyperconcentrator  chips  correspond 
to  steps  1  and  3,  and  the  wiring  between  the  stages 
corresponds  to  step  2.  This  correspondence  between 
the  circuit  and  Columnsort  allows  us  to  prove  the  fol¬ 
lowing  theorem. 

Theorem  4  The  Columnsort-based  construction 
yields  an  (n,  m,  1  -  (s  -  l)2/m)  partial  concentrator 
switch. 

Proof  Leighton  shows  in  [3]  that  Algorithm  2  is  an 
(s—  l)J-nearsorter  when  the  matrix  elements  are  taken 
in  row-major  order.  By  Lemma  2,  the  circuit  is  an 
(n,m,  1— («— l)J/m)  partial  concentrator  switch  when 
the  outputs  are  taken  in  row-major  order.  □ 
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output  wires 
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Figure  6:  A  two-dimensional  layout  of  the  Column- 
sort-based  partial  concentrator  switch  with  n  =  32  inputs 
and  m  =  18  outputs.  The  underlying  matrix  is  8  x  <t.  The 
electrical  paths  established  by  14  valid  messages  are  shown 
with  heavy  lines.  The  output  wires  are  the  first  five  out¬ 
put  wires  of  hyperconcentrator  chips  Hits  and  Hi.\  and 
the  first  four  output  wires  of  hyperconcentrator  chips  Hi,i 
and  Hi, 3. 


To  achieve  the  results  stated  at  the  beginning  of 
this  section,  we  let  r  =  ©(n*5)  and  s  =  Q(nl~0).  To 
ensure  that  n  =  rs  and  that  s  divides  r  as  n  increases, 
we  require  that  we  have  1/2  <  /7  <  1.  The  load  ratio 
is  then 


a  — 


The  number  of  chips  is  2 s  =  9 (n1-^),  and  each  chip 
requires  2 r  =  Q(n0)  data  pins. 

The  delay  through  the  switch  is  2  •  2  lg  r  +  0(1)  = 
4lgr  4-  0(1).  Letting  r  <  cn0  +  o(n P)  for  some  con¬ 
stant  c,  we  have  that  the  delay  is 

41gr  +  0(l)  <  41g(cn^  +  o(n^))  +  0(1) 

<  4  lg((c  +  l)n^)  (for  suff.  large  n) 

=  4/7  Ign  +  4/?Ig(c  +  1) 

=  4/7  Ign  +  0(1)  . 


A  two-dimensional  layout  using  0(n2)  area  is  shown 
in  Figure  6.  As  in  the  Revsort-based  switch,  we  use 
n  x  n  crossbar  wiring  to  connect  the  stages. 

Figure  7  shows  a  three-dimensional  packaging  of  the 
switch  using  volume  9(r2s)  =  0(nl+<J).  As  in  Fig¬ 
ure  6,  we  have  r  =  8  and  s  =  4.  There  are  two 
stacks  of  boards,  with  each  stack  containing  s  beards 


interstack  connectors 


stage  1  stage  2 


Figure  7:  The  three-dimensional  packaging  of  the 
Col umnsort- based  partial  concentrator  switch  for  r  =  8 
and  s  =  4.  Each  stack  contains  3  chips,  each  of  which 
is  an  r-by-r  hyperconcentrator.  The  wiring  between  the 
stages  of  chips  performs  the  RM~l  o  CM  permutation. 
The  interstack  connectors  transpose  the  wires  from  verti¬ 
cal  to  horizontal  alignment. 


outputs 


Figure  8:  The  transposition  of  w  wires  from  vertical 
to  horizontal  alignment,  shown  for  w  =  4,  using  volume 
6(w3). 


and  corresponding  to  one  stage  of  hyperconcentrator 
chips,  and  each  board  containing  one  r-by-r  hyper- 
concentrator  chip. 

The  tricky  part  of  this  construction  is  the  wiring 
between  stages,  which  must  perform  the  permuta¬ 
tion  RM~l  o  CM.  On  the  first  stack,  we  group  to¬ 
gether  output  wires  whose  column-major  numberings 
are  congruent  modulo  s,  or  equivalently,  those  whose 
row  numbers  are  congruent  modulo  s.  Each  such 
group  contains  r/s  wires.  In  Figure  7,  for  example, 
since  we  have  s  =  4,  we  group  together  wires  H  1,0,0 
and  H\, 0,4,  H  1,0,1  and  H  1,0,5,  smd  7fi.o,6,  H\,o,3 

and  H  1,0,7,  etc.  In  order  to  allow  them  to  enter  the 
stage  2  chips,  these  wires  are  then  “transposed”  in 
small  interstack  connectors  to  align  them  horizontally 
instead  of  vertically.  Figure  8  shows  one  way  to  trans¬ 
pose  a  group  of  r/s  wires  in  volume  0((r/s)2). 

The  first  stack  dominates  t volume  of  this  con¬ 
struction.  We  have  s  boards,  and  each  board  contains 
a  0(r2)-area  hyperconcentrator  chip  and  an  0(r2)- 
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area  wiring  permutation.  The  total  volume  of  each 
stack  is  thus  @(r2s )  =  Q(nl+l3).  There  are  s 2  inter- 
stack  connectors,  each  with  volume  0((r/s)2),  for  a 
total  interstack  volume  of  0(r2)  =  0(n23).  Since  we 
have  3  <  1,  the  total  interstack  volume  is  0(n1+3). 
The  total  volume  of  the  partial  concentrator  switch  is 
thus  0(n1+*3). 

For  both  the  two-dimensional  and  three-dimension¬ 
al  layouts,  letting  p,  the  number  of  pins  per  chip,  be 
@(r),  we  use  only  Q(s)  =  0(n/p)  chips.  The  three- 
dimensional  layout,  however,  uses  s2  =  0((n/p)2)  in¬ 
terstack  connectors,  but  these  connectors  contain  only 
wiring  and  no  active  components. 

6  Concluding  Remarks 

In  this  section,  we  briefly  discuss  the  characteristics 
of  the  partial  concentrator  switches  we  have  seen  and 
then  discuss  multichip  hyperconcentrator  switches. 
Finally,  we  pose  some  open  questions. 

Both  of  the  partial  concentrator  switches  we  have 
examined  are  efficient  in  that  they  are  relatively  fast 
and  can  be  packaged  with  a  relatively  low  volume. 
They  also  allow  air  to  flow  through  in  all  three  di¬ 
mensions  and  may  thus  be  air-cooled. 

The  0  parameter  of  the  Columnsort-based  switch 
defines  a  tradeoff  continuum  for  the  characteristics  of 
the  switch.  As  evidenced  by  Table  1,  as  the  value  of  0 
increases,  so  do  the  number  of  pins  per  chip,  delay,  and 
volume,  but  the  load  ratio  improves  and  the  number 
of  chips  decreases. 

Rather  than  simulating  just  the  first  steps  of 
Revsort  and  Columnsort,  one  could  simulate  the  full 
algorithms  to  fully  sort  the  valid  bits  and  thus  build 
multichip  hyperconcentrator  switches.  Compared  to 
the  partial  concentrator  switches  presented  above, 
such  hyperconcentrator  switches  have  increased  delay, 
and  a  Revsort-based  hyperconcentrator  switch  has  a 
greater  chip  count  and  asymptotic  volume  than  its 
partial  concentrator  counterpart. 

Schnorr  and  Shamir  show  in  [7]  that  if  steps  1-3 
of  Algorithm  1  are  repeated  pglgv/u|  times,  the  re¬ 
sulting  matrix  contains  at  most  eight  dirty  rows.  We 
can  then  complete  the  full  sorting  by  running  three 
iterations  of  the  Shearsort  algorithm  [6].  An  n-by-n 
hyperconcentrator  switch  based  on  the  full  Revsort  al¬ 
gorithm  consists  of  (Ig  Ig  >/n]  repetitions  of  stacks  1 
and  2  of  Figure  4  followed  by  three  pairs  of  different 
stacks  that  simulate  Shearsort.  (Each  Shearsort  3tack 
consists  of  v/n  boards,  each  of  which  contains  a  \fn- 
by -^/n  hyperconcentrator  chip  and  fixed  permutation 
wiring.)  A  signal  passes  through  21glgn+4  hypercon¬ 
centrator  chips  in  such  an  n-by-n  hyperconcentrator 


switch,  incurring  4  lg  n  !g  lg  n  +  8  lg  n  +  0(lg  ]g  n)  gate 
delays.  The  switch  uses  a  total  of  ©(,/n  lg  lg  n)  chips 
in  volume  0(n3^2  lg  lg  n). 

Similarly,  by  simulating  all  eight  steps  of  Column- 
sort,  we  can  build  a  hyperconcentrator  switch  with 
the  same  asymptotic  volume  and  chip  count  as  the 
partial  concentrator  switch  of  Section  5.  A  signal 
passes  through  four  chips  and  incurs  8dlgn  +  0(1) 
gate  delays  through  such  an  n-by-n  hvperconcentra- 
tor  switch. 

Rather  than  wondering  how  fast  a  multichip  hyper- 
concentrator  switch  we  can  build,  we  might  ask  for 
what  functions  f(p)  can  we  build  an  (Q(/(p)),m,  1  - 
o(p/m))  partial  concentrator  switch,  given  chips  with 
p  pins  and  using  only  two  stages  of  chips.  The 
Columnsort-based  construction,  for  example,  gives  us 
f(p)  —  P2~‘  for  any  0  <  £  <  1.  Can  we  achieve 
f(p)  =  0(p2)?  In  general,  how  large  a  function  /(p) 
can  we  achieve  with  k  stages? 

There  may  be  S-nearsorters  based  on  networks  other 
than  the  two-dimensional  mesh  to  which  we  can  ap¬ 
ply  Lemma  2.  What  types  of  partial  concentrator 
switches  can  we  build  by  applying  Lemma  2  to  other 
e-nea.  sorters? 
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Abstract 

In  this  thesis  we  study  graph  algorithms,  both  in  sequential  and  parallel  contexts.  In 
the  following  outline  of  the  thesis,  algorithm  complexities  are  stated  in  terms  of  the  number 
of  vertices  n,  the  number  of  edges  m,  the  largest  absolute  value  of  capacities  U ,  and  the 
largest  absolute  value  of  costs  C. 

In  Chapter  1  we  introduce  a  new  approach  to  the  maximum  flow  problem  that  leads 
to  better  algorithms  for  the  problem.  These  algorithms  include  an  0(nm  log( n2/m))  time 
sequential  algorithm,  an  0(n2  log  n)  time  parallel  algorithm  that  uses  0(n)  processors  and 
0(m)  memory,  and  both  synchronous  and  asynchronous  distributed  algorithms. 

Chapter  2  is  devoted  to  the  minimum  cost  flow  problem,  which  is  a  generalization 
of  the  maximum  flow  problem.  We  introduce  a  framework  that  allows  the  generalization 
of  the  maximum  flow  techniques  to  the  minimum-cost  flow  problem.  This  framework  al¬ 
lows  us  to  design  efficient  algorithms  for  the  minimum-cost  flow  problem.  We  exhibit 
0(nmlog(n)log(nC)),  0(n5/3m2/3 log( nC)),  and  0(n3  log(nC))  time  sequential  algorithms 
as  well  as  parallel  and  distributed  algorithms. 

In  Chapter  3  we  address  implementation  of  parallel  algorithms  through  a  case-study  of 
an  implementation  of  a  parallel  maximum  flow  algorithm.  Parallel  prefix  operations  play 
an  important  role  in  our  implementation.  We  present  experimental  results  achieved  by  the 
implementation. 

Parallel  symmetry-breaking  techniques  are  the  main  topic  of  Chapter  4.  We  give  an 
O(lg'n)  algorithm  for  3-coloring  a  rooted  tree.  This  algorithm  is  used  to  improve  several 
parallel  algorithms,  including  algorithms  for  A + 1-coloring  and  finding  maximal  independent 
set  in  constant-degree  graphs,  5-coloring  planar  graphs,  and  finding  a  maximal  matching 
in  planar  graphs.  We  also  prove  lower  bounds  on  the  parallel  complexity  of  the  maximal 
independent  set  problem  and  the  problem  of  2-coloring  a  rooted  tree. 
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ABSTRACT 


This  paper  considers  the  problem  of  maximizing  the  energy  or  average 

power  transfer  from  a  nonlinear  dynamic  source.  The  main 

theorem  includes  as  special  cases  the  standard  linear  result 

Y..  .  =  Y*  and  a  recent  findinc  for  nonlinear  resistive  net- 

•  Uuu  -  source  ^ 

works.  An  operator  equation  for  the  optimal  output  voltage  v(*)  is 

derived,  and  a  numerical  method  for  solving  it  is  given. 
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